A Tale of Two Probabilities
Probabilities: Statistical Mechanics and Bayesian:
Machine learning fuses several different lines of thought, including statistical mechanics, Bayesian probability theory, and neural networks.
There are two different ways of thinking about probability in machine learning; one comes from statistical mechanics, and the other from Bayesian logic. Both are important. They are also very different.
While these two different ways of thinking about probability are usually very separate, they come together in some of the more advanced machine learning topics, such as variational Bayes approximations. This post briefly covers the two different approaches to probability, and identifies how their distinct roles in machine learning.
Probabilities in Statistical Mechanics
Statistical mechanics describes a universe in which the components are simple units, and each unit inhabits a specific energy state. In the simplest possible version, the energy for each unit is defined by a single parameter. (Meaning – one energy state is set to zero, and another is defined by the parameter; simplest-possible two-state system.)
This is the basic statistical mechanics probability equation:
The probability of observing the system in a certain microstate (or unique configuration of units in a distinct energy state E(j)) is proportional to the negative exponent of the energy state itself.
Deciphering the Statistical Mechanics Probability Equation
There are two multiplying constants in this equation. One is the small Greek letter beta,
Beta is itself a combination of two terms; Boltzmann’s constant (k-sub-beta) and the temperature, T (in degrees Kelvin, meaning absolute temperature, so 0 degrees centigrade is 273 degrees Kelvin). In most machine learning applications, we will either arbitrarily set beta to 1, or divide through an entire equation by beta, thus creating “reduced” terms, which will be a topic for later.
The other multiplying factor is 1/Z, where Z (the normalizing function) is the partition function. It represents all the possible states into which the various units can be distributed.
Microstates and Why They’re Important
There’s a subtlety in this statistical mechanics probability equation; it’s the index j.
The index j is not referring to which unit we have in the system, nor is it referencing the different energy levels.
Rather, we’re talking about microstates – distinct configurations that the system can inhabit.
Imagine that every unit in the system has its own RFID tag. Imagine that the energy levels are like the floors in a building. We can use the RFID tags to know if a given unit is on one floor or another, that is, if it is at (energy) state E(j) or at state E(i). (Read those as E-sub-j and E-sub-i).
We say that one microstate is distinct from another if a unit shows up at a different level in the two microstates. That is, if a unit at the upper floor switches places with a unit at a lower floor, that’s a distinctly different microstate. If the units on any given floor just mill around – so it looks as though they’ve traded places with each other, but they’re still on the same floor – those are the same microstate.
To get our statistical mechanics probabilities, we need to sum over all the microstates. This summation creates a normalizing factor, which we call Z (the partition function). (And yes, there is much more detail in the Cribsheet; still to come.)
How the Statistical Mechanics Probability Helps Us
One of the most crucial statistical mechanics insights is that there will be relatively more units in low-energy states, and fewer units in high-energy states. I have more discussion on all of these terms in the Cribsheet, and there will be a greatly expanded discussion in my forthcoming book, Statistical Mechanics, Neural Networks, and Machine Learning.
One of the greatest uses of the statistical mechanics probability function is that it will feed directly into key concepts, such as entropy and free energy. Both of these are very important for machine learning.
Bayesian Probabilities
Comparing the Bayesian perspective on probability to the statistical mechanics one is not just like having two different views of the universe; it’s more like having two entirely different universes – each controlled by very different rules.
In the statistical mechanics universe, the only thing that characterizes a given unit is its energy state, and the entire statistical mechanics universe is composed of particles or units, each of which is in its particular energy state.
In contrast, in the Bayesian world, any conceivable number of things can exist – the things that we can describe are limited only by our imaginations. We can talk, for example, about the probability (likelihood) that we will have rain on a given afternoon in August, in a particular city. (Or, given the news of this week, we can talk about the likelihood of a particular hurricane hitting a particular city, with a certain level of damage.) We can talk about the probability of a given team winning its next match. We can talk about the likelihood of a patch of forest having a certain kind of tree.
For our purposes (and I’ll be expanding on this more in the Cribsheet, and also in the book), let’s take as an example the probability that a certain person will be a man (instead of a woman), given a certain height.
The key to all Bayesian probability work is that we must have a sufficiently rich set of data – spanning all the possible cases – in order to pre-determine probability values. So, for the example just given (probability of a person being a man if the height is given), we need to have sampled the population of both men and women to determine their height. We would then assemble a data table across all heights that we observed.
We would write our question using Bayesian probability notation as
The Left-Hand-Side (LHS) of this equation expresses the conditional dependence, or an a posteriori probability, that we will find a certain result (y) if we have a certain data value (x). (The vertical slash means “conditioned on” or “dependent on.”) The Right-Hand-Side (RHS) has two terms. p(y,x) is the joint probability distribution; it’s the distribution (or collection of) all measurements; height for both genders. p(x) is the simple probability of a given height occurring in that data set. (This is because we’re asking about gender identification as a function of height.)
In our example of gender-and-height dependence, y means the likelihood that the person is a man, and x means the person’s height.
This is a conditional probability equation; we are seeking a probability or likelihood that the person is a man, conditioned on a specific bit of information, the height.
There are obvious challenges with this; one is that we need to do a lot of data gathering. Another challenge is that we often need more than one input-variable (or independent variable). (Meaning, x can easily become (x, w, v, …) – we need more than just height (usually) to determine a person’s gender.)
Despite Bayesian probability’s demand that there be an exhaustive treasure-trove of data to feed into the equations, the whole approach is very popular – largely because it is mathematically elegant and useful. Still, the data demand is one of the biggest challenges that data scientists and machine learning specialists face when they seek to implement a Bayesian approach.
Sometimes, there is no way in which we can assemble all the possible data sets; there will always be something new and unexpected. That’s why machine learning nowadays puts a great deal of emphasis on inference as well as the original learning. (I just wrote an article on this for Product Design & Development; it’s just been published online, the link is in the References below.)
How to Know Which Probability Is Which
Usually, from context. And from the form of the equation, and the equations surrounding it.
Machine learning papers don’t normally shift from one “universe” of thinking to another without giving you at least some warning.
More discussion in my Cribsheet, which is ALMOST ready, and you’ll get a separate email announcing it in just a day or so. (You ARE opted in using the Opt-In form in the right sidebar, aren’t you? Yes, you are. THANK YOU! This will help you know when each new good thing is released.)
Probability in Machine Learning Inference Methods
Have a look, please, at the seven essential equations once again.
See those last two equations, the machine learning ones? (Numbers 6 & 7; Kullback-Leibler divergence and variational Bayes, respectively.) They’re all about probabilities.
Yes, those two equations show probabilities of the Bayesian sort. That doesn’t let us off the hook for learning the other type as well.
Over the past decade or so, machine learning has quietly shifted its foundation; it is now all about energy minimization and expectation maximization. These are codewords, as you’ll see when you get your hands on the Cribsheet, for statistical mechanics-based approaches.
Energy minimization? It’s about minimizing the free energy, which depends 100% on statistical mechanics. (Yes, more in the Cribsheet. Even more in the book; that’s why I’m writing it – so we can ALL get clear.)
Expectation maximization? That means figuring out which energy state is most likely; this is also an uber-reliance on statistical mechanics.
If we’re going to do the real, hard-core machine learning, there’s no getting away from or around statistical mechanics.
That’s why we’re taking it straight on.
So we’re going to start with the basics. Probabilities. Microstates. Partition functions.
The math will be just as rigorous as if you were taking an upper-level or graduate class in physical chemistry.
The only difference: we’ll focus ONLY on that which we need to know for machine learning, safely ignoring topics such as the heat capacity, dielectric constant of solids, etc.
At the end, though, you should be able to read some pretty heavyweight papers.
At the end of the Cribsheet, you should at least be able to identify the mountains in your line of sight, meaning – you should at least be able to read an equation and know roughly what it is about, even if you can’t solve it or derive it or use it just yet.
You’ll get the word in just a day or so. Haven’t had a between-academic-quarters vacation yet, and I won’t until this Cribsheet is done, so I’m motivated.
See you soon! – AJM
Live free or die, my friend –
AJ Maren
Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War
P.S. – About That Cribsheet
Yesterday, putting what I thought were the final touches on the Cribsheet, I had a come-uppance. I mean, serious egg on my face. Embarassing.
You see, I was just a little … off … on my understanding of the summation index in the partition function (which is where we sum up all the probabilities to get a normalizing factor). Meaning, I was a little off in interpreting the probability equation itself.
The index j referred to microstates, which I had completely forgotten.
I spent a long day digging myself out of that hole; finding references, reading, cross-checking, and starting a new PPT slidedeck that had examples (very detailed, painstaking examples, because I sure was going to get it right THIS time!).
Come midnight, I was up again, working for a few more hours on that slidedeck.
Early morning, and I am oh-so-glad that I got the bulk of this post drafted earlier this week, thinking I was such a smartie and being ahead of schedule.
As it turns out, with the blog post just needing finishing touches (and a bit of discussion on microstates), you’ll have the blog post very soon.
The Cribsheet? A few more days …
And not without a lesson learned, and one that’s valuable to all of us – you as well as me.
It’s the little things that trip us up. Such as not being clear on what the summation index j is referring to.
One of you (R.S., commenting on The Statistical Mechanics Underpinning of Machine Learning) recently said: “I feel in the workplace today, one is expected to have a short runway – always – to be competitive, to deliver value to the internal & external clients. …” The question is: how do we go from a long runway to a short one?
The answer – let me shift metaphors here, back to the notion of crossing the Rockies – is this: we can’t cut out the distance that we have to travel. If we need to learn some statistical mechanics, or Bayesian probability theory, or Kullback-Leibler divergences, or anything else – well, then, we simply have to learn it.
However, we CAN cut out unnecessary, time-consuming, and anxiety-producing side detours. Such as thinking that the summation is over the number of units or energy states, when instead, it’s over microstates.
In practical terms, that means studying with people who are doing their best to lay things out for you.
There are fortunately an increasing number of good sources. (Last week, I identified several – and a couple of you offered more good sources in the comments – check out Labor Day Reading and Academic Year Kick-Off, and be sure to check out the Comments, which is where a couple of you offered some very good resources in addition to what I’ve gathered.)
Essentially, there are more good guides through the mountain passes right now.
In a year or two, it will be easier yet. There will be comprehensive, start-to-finish books with coordinated resources, such as slidedecks and teaching videos, complete with code and examples and exercises. (There are already some of these; there will simply be more, and they’ll cover a greater range.)
This is a lot like saying: in a few years, we’ll have a train that will take us through the mountains.
Which we will.
Right now, though, you want to get there in the early stages of the gold rush. That’s why you’re taking on this isolated, uncomfortable, and very difficult journey. The pot of gold awaits.
To your victory, my friend.
Best – AJM
References
Good Basic Intros
- Analytics Vidhya (2016), Bayesian Statistics explained to Beginners in Simple English (June 20, 2016). blogpost. (AJM’s Note: If you don’t know your Bayesian probabilities all that well, this is a fairly decent intro.)
- Maren, A.J. (Dec., 2013) Statistical Thermodynamics: Basic Theory and Equations. THM TR2013-001(ajm).
My Recent – Easy-to-Read – Articles
Some of my own recent papers – you might particularly like the first one, on GPUs, AI, and deep learning:
- Maren, A.J. (2017), How GPUs, AI and deep learning make smart products smarter, Product Design and Development (September/October, 2017), Part 1 – online article.
- Maren, A.J. (2017), Designing like it’s 2025: next-gen CAD technologies give designers “awesome superpowers,” Product Design and Development (July/August, 2017), 14-17, online article.
- Goldberg, L., Crouse, M., and Maren, A.J. (2016), Artificial intelligence invades consumer electronics, Product Design and Development (September, 2016), 10-12, online article.
Previous Related Posts
- The Statistical Mechanics Underpinnings of Machine Learning
- Seven Statistical Mechanics / Bayesian Equations That You Need to Know
7 thoughts on “A Tale of Two Probabilities”