Getting Started in Deep Learning
It’s been a lovely Christmas day. More social than any that I can remember, for a very long time. (Wonderful morning visit with my next-door neighbor. Great mid-day party. A sweet restorative nap afterwards.) And now, the thoughts that have been buzzing through and around my head for the past 48 hours — how to get started with deep learning.
Of course there are all sorts of entry points. Historical, functional, mathematical…
But what came to me, over these past couple of days, and really just crystalized as I woke up from my nap, was that the crux of the biscuit was, and always has been, the classic credit assignment problem.
So if we need to start someplace (and we do), this is it.
A brief recap.
The great hindrance to gaining functional neural networks, back before their emergence in 1986 (with publication of the two-volume set, Parallel Distributed Processing, or PDP), was the lack of learning algorithms that could cause the weights in a neural network to take on appropriate values. More specifically, as a neural network becomes trained, then the credit for having a given output node take on the right (trained-for and desired) value is a function of the weights connecting the hidden nodes to that final output node having the right values. Then also, the weights connecting the input nodes in the network to the hidden nodes also have to take on the right values.
In the figure to the right, we’re tracing the credit assignment of the single active output node, O2. O2 receives activating inputs from hidden layer nodes H1, H3, and H4. Thus, the first step in identifying the credit assignment task for this particular configuration lies in figuring out how much of O2’s activation is due to H3, as opposed to H1 or H4.
However, the credit assignment goes another layer deeper. Hidden node H3 receives activating inputs from input nodes I1, I3, and I5. So if we really want to figure out the credit assignment, this means that we need to figure out how much of H3’s activation here is due to each of these specific input nodes.
In particular, if we want to figure out how much I5 influences O2, we’d have to back-calculate O2’s dependence on H3, and then H3’s dependence on I5.
This really is not that hard, computationally. In fact, this is EXACTLY what the backpropagation learning algorithm does; it figures out each of these dependencies (using the chain rule). Then, to improve the learning of the network, it subtly adjusts each of the weights involved, in order to get the resulting desired outputs corresponding to a given set of inputs.
The breakthrough in solving the initial credit assignment problem (via the backpropagation network) is due to Paul Werbos (1974, Ph.D. Thesis). That work was later published by another author in the PDP in 1986. The success of the backpropagation learning rule, and also of the Boltzmann machine learning rule, each applied to the same kind of Multilayer Perceptron (MLP) neural network, gave rise to the enormous popularity of neural networks that followed over the next decade.
However, there was a problem in applying this approach to credit assignment to to networks that were much deeper than a couple of hidden layers. Let’s frame this mathematically, and also pictorially.
Mathematically, we can say that the output of O2 is a function of the inputs that it receives from all the H nodes. We can write:
This just says that y, the value of an output node, is a function of x, the inputs.
More precisely, though, we’d want to say that y is a function of the inputs (f-1; read this as f1, or function 1, where 1 is superscripted, and NOT as “f minus one,” see the equation below), and that another function is applied to the this function (f-2). In other words, the hidden node values are a function of the inputs (f-1), and then the output node values are a function of the hidden node values (f-2), so really they are a function of a function.
We say that f-2 convolves f-1, and write it as follows:
The backpropagation learning rule works very well with this level of complexity. In fact, it can even handle another layer.
It’s when we start to create networks with multiple hidden layers that things become more difficult.
What happens, for example, when we try to do credit assignment in a network that has three hidden layers, as shown in the accompanying figure?
Specifically, we want to trace the credit assignment through each and every path. A single illustrative path is shown in the following figure.
The influence of the input node I5 is now passing through several different functions, and being combined with the influence of the other active input nodes (I1 and I3) several times, in several ways. In fact, since we now have three hidden layers along with an output layer, there are a total of four layers of functions, each working off the preceding layer.
Mathematically, we would write this as follows:
So here’s the problem. Mathematically, this is the “crux of the biscuit” when it comes to credit assignment. When an input goes through multiple functions, and is combined and recombined several times, it becomes more difficult to trace the impact of any one step on determining the importance of that input (and it’s path) on the final outcome.
Mathematically, people note that the gradient becomes shallower. We’ll broach that topic in a different post. For now, it should be intuitively clear that the deeper the network, the more diffuse the credit assignment becomes, because any single initial input goes through so many nonlinear steps of combination and transformation.
The backpropagation algorithm still works, but it becomes more cumbersome.
This, then, is largely what kept neural networks stymied during the 1990’s and 2000’s. They worked very well, when using backpropagation and only one, or even two or sometimes three, hidden layers. More than that, and it was very difficult to obtain good results.
Another solution was needed.
This solution became the core of the resurgence of neural networks, now relabeled deep learning.
Deep Learning References for the Credit Assignment Problem
- Goodfellow, Bengio, Courville (2016), Deep Learning, Chapt. 6, “Deep Feedforward Networks” (Cambridge, MA: MIT Press), pp. 168-169. AJM’s Comment: The various equations here are intended to help people going through the very first page(s) of Chapter 6 in the Deep Learning book.
- A very useful discussion on Quora: Is a Backpropagation Neural Network with Many Hidden Layer Technically Considered Deep Learning?
5 thoughts on “Getting Started in Deep Learning”