Start Here: Statistical Mechanics for Neural Networks and AI
Your Pathway through the Blog-Maze:
What to read, and what order to read things in, if you’re trying to teach yourself the rudiments of statistical mechanics – just enough to get a sense of what’s going on in the REAL deep learning papers.
As we all know, there’s two basic realms of deep learning neural networks. There’s the kind that only requires (some, limited) knowledge of backpropagation. That’s first semester undergraduate calculus, and almost everyone coming into this field can handle that.
The second realm of deep learning and neural networks requires – as a bare, basic minimum – first semester graduate-school statistical physics; commonly referred to as statistical mechanics or statistical thermodynamics (depending on whether you are a physicist or a physical chemist; the core material is the same). Very, very few people have that under their belts, walking into the deep learning / AI area.
That means that as soon as you try to teach yourself the requisite statistical mechanics, you’re very likely going to find yourself in Donner Pass.
Donner Pass, for those of you new to my posts, looks like this.
Here’s another Donner Pass example.
The immediately-previous figure shows, as Eqn. 4, the energy equation – very similar to what we saw in the previous figure. It also shows, as Eqn. 3, something called a partition function.
The challenge – for most people attempting to self-educate in neural networks and deep learning – is that this partition function is central to statistical mechanics, and this is not an easy subject to teach oneself. Further, the textbooks on this subject are typically very abstract; they’re only accessible to those who are used to thinking in terms of high-level mathematics. Add to that, these texts often address a range of subjects that are really not needed by someone who wants just enough to understand deep learning and related materials.
Thus, the typical person – encountering the partition function, energy equations, and related physics-based material in the deep learning literature – will start massively searching for something – anything – that will explain what’s going on.
Pretty soon, the intrepid adventurer will be surrounded by pages of physics-based papers, both the classic statistical mechanics literature and also the many deep learning / AI / machine learning papers. The notation will be a little different in each. The notions of how certain things came into being – such as how a partition function came to be, and why it’s important – will be obscured. The papers, blogposts, book chapters, and Quora posts mount up – and our unfortunate adventurer is lost in the white-out of too much information and no true north. He (or she) gives up, and is frozen in the wasteland of not-ever-getting to the desired understanding.
A Bit of Backstory – Statistical Mechanics and Neural Networks
Partition functions are the heart and soul of statistical mechanics / statistical thermodynamics. Invented by Ludwig Eduord Boltzmann, statistical mechanics is based on the notion that a system of very small particles, indistinguishable from each other except via their individual energies, can be described via a set of equations. Further, the macroscopic thermodynamic properties of this system – how it behaves when pressure, volume, and temperature are changed – can be derived from the equations describing the microscopic aspect of the system; the equations that deal with the energies of these small particles.
Boltzmann’s work was a huge breakthrough; very much on a par with Einstein’s discovery of relativity. It brought into being a new way of understanding the physical properties of systems. However, not everyone believed in his work, and he spent the latter years of his life defending his ideas. (He also died by his own hand in 1906. Boltzmann’s death by suicide, and likewise the suicides of some other notables in the field of statistical mechanics, have been used as a tongue-in-cheek cautionary tale, warning off those who would make this their field of study.)
In 1982, John Hopfield used the underlying notions of statistical mechanics to create what was essentially an allegory – a system composed of multiple units, each of which had a certain energy. This gave rise to what is now referred to as the Hopfield neural network.
The Hopfield neural network was a brilliant concoction. However, it had severe memory problems – it could only learn to reconstruct a relatively few number of patterns that one might attempt to store in it.
Despite the Hopfield network’s memory-limitations, the genius idea behind it was still of significance. Geoffrey Hinton, Terrence Sejnowski, and David Ackley played with the notion of using statistical physics as the basis for learning in a neural network, and came up with what they called a Boltzmann machine, trained using a process that they called simulated annealing. Hinton and Sejnowski published this invention in the breakthrough two-volume set Parallel Distributed Processing, released in 1986. The Boltzmann machine was thus released to the world at the same time that the backpropagation algorithm for stochastic gradient descent became widely known.
During the late 1980’s, the backpropagation learning rule (which was how most people referred to it; not everyone was precise about identifying backprop simply as a means for carrying out stochastic gradient descent) was hugely popular. It could be coded in just a few lines. It could be understood by anyone who’d gotten through first-semester calculus.
The Boltzmann machine, however, was much more intimidating. While a Boltzmann machine neural network looked like (and once trained, operated like) a Multilayer Perceptron (MLP; the kind usually trained with backprop), it was much more complex to both understand and to code.
The world revolved around backpropagation.
Then, inevitably, people came to know the limitations of a simple three-layer system; one containing only an input, hidden, and output layer. Attempts to build multilayer systems with more than one hidden layer often resulted in crushing defeats. (Imagine trying to build a skyscraper with only stairs to get between the levels, and no elevator.)
We entered into the neural networks winter, which lasted from about the mid-1990’s through about the mid-2000’s.
Hinton, by this time, was safely secured in a tenured position at the University of Toronto. His funding may have been wobbly, but he was free to pursue his research – which he did. At various times, he found equivalently genius-level graduate students to work with him.
Bit by bit, Hinton and students kept chipping away.
In 2006, decades of work came together in a brief – but stunning – four-page article in Science. (Those of you who know, know that publishing in Science is like mounting the stairway to heaven. It takes a while to get there.) The work described in this article used stacks of convolutional layers (to extract local features), together with hidden layers trained using both backpropagation and the training method for Hinton’s refined invention; the restricted Boltzmann machine (RBM). They showed extraordinary performance using their network as an autoencoder.
An autoencoder does what the original Hopfield network set out to do. That is, it can learn a variety of patterns. (For the Hinton et al. experiments, these were sets of images.) Then, given a partial or noisy image, it could recreate the originally-learned image. It was able to do this with a large number of stored patterns, overcoming the original limitation of the Hopfield network of some thirty-years prior.
The age of deep learning was born.
All of a sudden, a field that was dying a slow death received a life-restoring transfusion.
Also, anyone who didn’t know physics was hosed.
Where We Are Now, and How These Blogs Fit In
Since Hinton and Salakhutdinov published their 2006 paper in Science, the success of deep learning neural networks has escalated, and thousands (upon thousands) of people are trying to enter the AI and deep learning field. They do just fine with the backpropagation part of things.
They are completely hosed when it comes to the statistical physics-based part of deep learning.
Even the 2016 Deep Learning book by Goodfellow, Bengio, and Courville, which has Chapter 18 devoted to “Confronting the Partition Function,” is still a bit too intimidating for someone who has struggled to remember enough of their first-semester calculus to refresh their memory of the chain rule and to work (haltingly and grudgingly) through the backpropagation derivation.
This has led to what I call the Great Divide in AI; those who can’t master the requisite stat mech are forever stuck in Fort Laramie in Wyoming. Many that attempt to traverse over the Rocky Mountains of statistical mechanics get lost in a white-out of too much information and not enough direction. (Donner Pass, and I know that the geography gets a little mushed-up at this point, but the meaning is still clear.)
What’s needed is a guide over the Rocky Mountains / Sierra Nevadas – something that will get a person from Fort Laramie through the requisite statistical mechanics and safely able to work with the deep learning processes on the other side.
I’ve been working on that, in bits and pieces, for the last 2 1/2 years.
For the last two years, I’ve been working on a book that would pull this together.
But the book is at least a year off, maybe longer. I may be able to publish some crucial chapters soon, but in the meantime, I’ve had students asking for a structured approach to reading the various pre-book blog posts.
Thus, this post will be the first effort to organize the collection of posts into a structured reading list.
What to Read, in What Order, and How Much Work Is Involved
First – if you’re new to statistical mechanics, and are looking for what amounts to a traveler’s guide through the Rocky Mountains, please ignore anything that I’ve written that deals with the Cluster Variation Method (CVM). You’ll know what these posts are, since they all mention “Cluster Variation Method” or “CVM” in their title. The CVM is an advanced form of statistical mechanics, and it makes no sense at all to try to understand it unless and until you have the basics of stat mech down just fine.
Further, the whole CVM notion (despite my getting recent computational results) is very much out-there. It’s not yet in use for any practical applications. It won’t be practical for a while, and I’m the only person (to my knowledge) who is doing research in this topic right now. So … for at least the next year or so, this won’t be of value to anyone. I’m just putting myself through the drill of sharing research results, leading up to formal publications.
Second, there are a number of fluffy-wuffy posts. These were typically written when I was trying to make a weekly blogpost-publication schedule. (My demand on myself, and the whole thing fell apart when I spent the better part of 2018 doing AI course development for Northwestern University. Not many posts during those months.) You can safely ignore the fluffy-wuffy. (There are a few worth reading. I’ll put them into a separate reading list.)
This leaves less than a third of the posts that are useful in teaching yourself about statistical mechanics. Again, our goal is to know just enough to be able to kind-of, sort-of navigate through one of the classic deep learning papers.
Your Most-Important First Step
If you’re serious about learning some rudimentary statistical mechanics, please go to my book page. Scroll to the end. You’ll see an “Opt-In” form.
Opt-In.
It will give you an eight-day tutorial email sequence. Some of those will be fluffy. Some are hard-core, and contain some really good slidedecks. (If I do say so, myself.) The very best introduction that I can come up with to microstates.
And if you want to understand the partition function, you first need to wrap your head around microstates.
And until you understand the partition function, everything else is a trip through fairy-land.
So Opt-In, and go through the eight days of checking your email. (Look in your alternate folders – “Promotions” or “Junk” – until you find them, and move them to your “Primary” folder.)
They will get you started.
Your First Three Blogposts – and Why They’re Important
Here’s the first three blogposts that would be useful to you, in order:
- Seven Statistical Mechanical – Bayesian Equations that You Need to Know,
- Seven Essential Machine Learning Equations – A Cribsheet, and
- A Tale of Two Probabilities
Briefly, here’s what each of them does for you.
The First One
This July, 2017 post – Seven Statistical Mechanical – Bayesian Equations that You Need to Know – was my first effort to organize what we needed to learn. About halfway through this post is what I think of as the crucial map for statistical mechanics (and Bayesian theory) that’s essential to AI / deep learning / machine learning. This “map” identifies seven crucial equations, and it’s like looking at a map of the mountain range before we start going through it.
The Second One
The second recommended post is Seven Essential Machine Learning Equations – A Cribsheet. This builds on the first one, and gives you (yes, once again) the Opt-In to my book email series, which is all about microstates. And – MOST IMPORTANT! – tells you about the Précis!
I had totally forgotten about the Précis. It’s sort of an early chapter draft for the statistical mechanics introduction portion of the book. I’ll have to go re-read it to figure out what I’ve got in there, but I do remember that it builds on the microstates eight-day email tutorial.
The Third One
The third blogpost that I suggest for your reading series is A Tale of Two Probabilities. This builds on the previous two, and introduces something really important: there are two different kinds of probabilities that are foundational to deep learning / machine learning theory.
This is one of the weirdest, insanity-inducing notions in deep learning / machine learning. It’s as though we are living in two different dimensions simultaneously, and then trying to bring these two dimensions into alignment and co-existence with one another.
If we thought it was tough to understand how statistical mechanics (which, you may know by now deals with probabilities – but in a physics sense) formed an underpinning for deep learning, then it is totally weird to understand how Bayesian probabilities also form an underpinning.
This is a lot like some of those early alchemical texts trying to fuse the elements of water and fire. Totally mind-boggling, altered-state kind of thinking. And unless we can step back from the equations, and wrap our heads around the enormity of what we’re doing, then we’ll be dominated by the equations, instead of us dominating them.
A must-read.
And a Few More
For those of you with extra time on your hands, and a high pain tolerance, I list a couple of other blogs in my recommendations at the end. Feel free.
After we’ve gotten a basic handle on the partition function, our next steps are entropy and free energy. I start in on those two topics in the additional recommended blogs.
One of the most mind-boggling things (and there are several, as always …) is that we often get an equation that has just part of the free energy for a system. We get the enthalpy (which people present as the “energy” of a system; it’s the enthalpy and not the total free energy). Or we get the entropy of a system, as in when we compute Shannon entropy or do some sort of entropy calculation.
However, neither enthalpy (sometimes presented as “energy”) or entropy live alone. They’re always connected, and they’re always in this dialectic that leads to free energy minimization.
Another important thing to grasp.
So it seems as though every time we get clear on something, there’s another topic – just as important, just as compelling – that appears murky and confusing.
The blogposts, and the forthcoming book, should help us all. (Ultimately, I’m the one who will benefit the most. You know the old adage, that the best way to learn something is to teach it. Well, nothing more demanding for “teaching” something than to write a book on it.)
Live free or die, my friend –
AJ Maren
Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War
Key References
Useful Books and Papers
Dr. A.J.’s Note: These are highly-regarded papers and book chapters. However, it will be nearly impossible to read them until you’ve got some statistical mechanics solidly under your belt. So … have a quick glance, if you must. Then, settle down and study the stat mech. You’ll know that you’ve learned enough when you can return to these book chapter(s)/papers and know roughly what they’re talking about.
- LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M.A., and Huang, F.J. (2006). A Tutorial on Energy-Based Learning, in G. Bakir, T. Hofman, B. Scholkopf, A. Smola, B. Taskar (eds), Predicting Structured Data (Cambridge, MA: MIT Press). pdf
- LeCun, Y. (2008, July 16). Energy-Based Models and Deep Learning. Abstract for seminar. pdf.
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning (Cambridgem MA: MIT Press). URL:www.deeplearningbook.org. Dr. A.J.’s Note: The link is to the entire book. The chapter that I mentioned earlier is Chapter 18; you can use the Table of Contents that this link gives you to jump to that chapter. Chapter 16 is also important and useful, from a statistical mechanics perspective.
Useful Technical Blog Posts
Dr. A.J.’s Note: A bit of a a grab-bag. I’ve come across these while working on this post and the current book chapter. They’re good, but I’m not saying that they’re the best starting places – they just looked good when I found them.
- Altosaar,Jaan, How does physics connect with machine learning?
- Martin, C. (2017, July 4). What is free energy: Hinton, Helmholtz, and Legendre. Calculated Content (blogpost series). blogpost. Dr. A.J.’s Note: Charles Martin, who writes these Calculated Content posts, is brilliant. He’s also in that same class of people who are writing about physics for physicists. If we’re successful, one of the outcomes of reading my blogs and book chapters will be that you can then actually read some of Dr. Martin’s posts.
Previous Related Posts
Start here (the same three blogposts listed in the post above):
- Seven Statistical Mechanical – Bayesian Equations that You Need to Know,
- Seven Essential Machine Learning Equations – A Cribsheet, and
- A Tale of Two Probabilities
If you have more energy and a high tolerance for pain: