Notational Frenzy
When the Subtle Art of Mathematical Notation Defeats You (and How to Fight Back)
A couple of years ago, I was teaching Time Series and Forecasting for the first time. I didn’t know the subject – at all – but that didn’t bother me. Hey, it was mathematics, right? Anything that’s mathematical, I can eat for lunch, and then want some dessert-equations afterwards.
First week, introducing the subject. That went fine.
Second week, Simple Exponential Smoothing (SES). That’s simple. It went fine.
Third week, ARMA – standing for Autoregressive Moving Average. And I got stuck.
I got stuck really, REALLY bad. I couldn’t figure something out … it drove me crazy … I was consumed … (I was also embarrassed, because I was the teacher, and I’m supposed to know this stuff) … and I got sick. Just a weary, draining case of mid-winter flu.
So there I was, feeling really out-of-sorts, and continuing to read away, going after source after source.
If you’re studying machine learning (which is the ONLY reason that you’re reading this blog, right?), you’ve been in that situation. Many times.
Well, I was exhausted with the whole thing. I lay down for a nap. And in this sort of drowsy, hypnogogic state, I suddenly had the insight. And within a couple of seconds after that, I knew how to put together a slide to “tell the story” to my students – who were waiting on me to tell them the story. Of WHERE that particular, persnickety little equation came from, HOW it was derived, WHAT it was going to do for them … I had the whole thing figured out.
Here’s the slide:
The key insight? It was that the text authors had made a notation change, and hadn’t bothered to tell us about it.
Here’s the slide where I explained it:
All this means that the authors had stopped using the “y-hat” notation for the estimate (y with a little arrow-pointing-up on top of it), and had started using just “y” (as a shortcut for y-estimate at time t, conditioned on y-estimate at time t-1). Dropping that little “hat” makes a world of difference.
If that little “hat” means that we’re talking about an estimate, and y without the hat means that we’re talking about the actual observation, then dropping the “hat” means a huge confusion. Unless you tell your audience what you’re doing, which in this case, the authors had failed to do.
Okay, I finally figured it out. Made sure that my students knew that this was THE MOST IMPORTANT lecture and slidedeck of the entire quarter. Dragged them by the scruffs of their necks over to that equation and buried their noses in it. All so that they wouldn’t have the same degree of confused-sick that I had gone through.
Now, here’s the real kicker: Most students, in most versions of this course, would not have figured this out. It’s just too subtle a little detail. They’re immensely intelligent, hard-working to an extreme, but even the best intentions have their limits when confronted with the time available for a task.
If I hadn’t buried their noses into that equation, and made a real point and to-do over how the equation was derived, etc., they would have just gone on, plugged in some R code, and run with it. Knowing that there was a weak point in their foundation, but not having the time – or the resources – to deal with it.
That’s just reality, for most folks.
There was an extra lesson in this whole thing for me, though.
I hadn’t had a smack-in-the-face with notational challenges that were this bad since maybe … graduate school, when I was trying to cross-correlate several papers on the same topic, and each person had something just different enough to be mind-boggling.
I learned that notation can be the most important thing. It can make-or-break you. It can also be the most delicate, time-consuming, maddening, and potentially devastating aspect of trying to learn something – especially when different authors have different notations. (Or when certain authors shift notations, without telling you.)
And Yet Again …
Same kind of challenge, just a few months ago. I had been trying to read and understand Karl Friston’s papers.
Now, I love the man. Really, I do.
But the way in which he uses notation drives me bat-sh** crazy.
To the point at which I took all of the prior quarter break, plus a few weeks into the next quarter, to just do the “altered state” thing and work through his equations.
I finally got the insights. I wrote a paper explaining those insights. Some 30 or 40-odd pages of paper, if memory serves, all doing a delicate derivation-and-translation of Karl’s work, referencing against the more common derivation of Matthew Beal, whom Karl had referenced. (There are a LOT of sources for this, I was just using Karl’s source, to get closest comparison.)
So there I was, paper done, I’d even shipped it to Karl and another colleague for their review, and I’d settled down to enjoy a Sunday afternoon.
Except that … this faint uneasiness kept bothering me.
I went back. Reread Karl’s paper, mine, Beal’s. And there it was. As delicate and subtle – and as overwhelmingly important – as could possibly be. It had to do with (of all things) the index for summation, x, that Karl had in his paper.
Quick retraction email.
The next several days, once again, altered-state total-focus on interpreting that subtle-little-life-changing meaning of that equation.
Revised paper out. (Karl likes it, by the way. He’s got some comments – still need to work my way through them, apparently there’s some OTHER subtle little thing that I don’t yet understand … life is full of them.)
Another lesson learned.
It can be as subtle as interpreting the index of summation (or realm for integration) in an equation. Something that we’d normally just sort of blur out.
And Again …
Since I’ve been telling this story for the past few weeks, you won’t be surprised if I mention my most recent faux-pas; I’d mis-interpreted the index for summation in the partition function. Now, this is as embarrassing as it can get. I can only plead extreme rustiness on my part.
But as we all know (at least, those of us pursuing the more abstract forms of machine learning), statistical mechanics (most significantly the partition function) is foundational to energy-based machine learning.
So, once again, I’m doing a variation on the theme of building the slidedeck and hauling people on over to it.
In this case, though, it’s not just a single slidedeck.
It’s a slidedeck plus an extensive series of follow-on tutorial emails, augmented by Content Pages – there’s as much depth and richness in those Content Pages as if I were teaching a course.
Which I am, really. It’s that old adage:
We teach best what we most need to learn. Richard Bach
I need to learn this stuff, therefore, I’m teaching you.
If I Were in Your Shoes Right Now …
I’ve set up my life so that I can, sometimes, when I really need to, immerse myself in code and equations. I can do that “altered-state” thing, kind of like an extended Zen meditation on an equation.
Most of you can’t.
It’s not a fault; it’s simple practical reality.
Most of you have lives that are considerably more complex than mine. You’re not exactly “equation-optimized.”
So if I were you, trying to learn machine learning, on my own (or even in a course or two or three), I’d find myself in a bit of a pickle.
Not trying to be cute or facetious with you, but the books that are out there right now seem to be written (if they’re dealing with energy-based models) by physicists for physicists, and if you didn’t grow up being a physicist, you’re hosed.
That’s one problem.
Another – just as big, just as challenging – is that there is a LOT of really good teaching materials out there. There’s tutorials, white papers, blogs, YouTube vids, Coursera courses … you name it.
But everyone has just a slightly different notation.
O, that way madness lies; let me shun that; No more of that.
Here’s a short, to-the-point, madness-inducing example.
Both David Blei et al. and Matthew Beal have great tutorials out on variational Bayes.
Beal’s equations 2.10 and 2.11 (pg. 47) look like this:
and this:
Blei et al.’s equations are like this:
The two equations are essentially the same thing. However, the x in Beal’s work is just different enough from the x in Blei et al.’s work to make reading the two papers, side-by-side, a descent into madness. The only solution is to translate one, entirely, into the notational framework of the other.
So, okay. You’ve got to do something. Reading heavily physics-based stuff, if you’re not a physicist, will not work.
Likewise, trying to read/watch/study numerous sources on the same topic, simultaneously, will be far more exhausting and draining than it will be productive. It’s just the mind goes into all sorts of little circular eddies when trying to cross-correlate notations.
The solution that I propose is: Find ONE person who is addressing what you want, and get close behind that person, conga-line style. Let that person make the trail.
By analogy: Going up the steep slope of physics, if you’re not prepared with the right background, can be devastating. To time, energy, and self-confidence.
Equally devastating: trying to follow too many trails through the mountain passes, at one time.
Potentially survivable: Staying close to an experienced trail guide, and going through what they say; methodically, diligently, conscientiously.
Once you’ve gotten to the other side – once you’ve reached the Gold Rush in California – then you know enough to come back and try different routes through the mountains.
It’s just getting there the first time that’s a challenge.
Here’s a viable path: Statistical Mechanics, Neural Networks, and Machine Learning – The Book.
Live free or die, my friend –
AJ Maren
Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War
P.S. – Access both my Precis and the bonus Microstates Slidedeck, AND the follow-on tutorial email sequence: Seven Essential Machine Learning Equations: A Cribsheet (Really, the Précis)
You can also get it at: Statistical Mechanics, Neural Networks, and Machine Learning – The Book.
References
- David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe (2016), Variational Inference: A Review for Statisticians, arXiv:1601.00670 [stat.CO], doi: 10.1080/01621459.2017.1285773. pdf
- Matthew Beal (2003), Variational Algorithms for Approximate Bayesian Inference, PhD. Thesis, Gatsby Computational Neuroscience Unit, University College London. pdf
Previous Related Posts
- Seven Essential Machine Learning Equations: A Cribsheet (Really, the Précis)
- How to Read Karl Friston (in the Original Greek)
- Seven Statistical Mechanics and Bayesian Equations That You Need to Know
- Approximate Bayesian Inference
- The Single Most Important Equation for Brain-Computer Information Interfaces (Kullback-Leibler)