Selecting a Neural Network Transfer Function: Classic vs. Current
Neural Network Transfer Functions: Sigmoid, Tanh, and ReLU
Making it or breaking it with neural networks: how to make smart choices.
Why We Weren’t Getting Convergence
This last week, in working with a very simple and straightforward XOR neural network, a lot of my students were having convergence problems.
The most likely reason?
Very likely, it’s been my choice for the transfer function.
I had given them a very simple network. (Lots of them are still coming up to speed in Python, so simple = good.)
The network diagram is on the left; it shows the flow of credit assignment from the final summed-squared-error (SSE) to the input nodes, and particularly to the various connection weights. (The bias terms for both the hidden and output nodes are included in the code, but are not shown in this figure.)
My students were getting convergence sometimes, but not all the time.
Since they were (mostly) newbies to neural networks, they were doing things like running their network for many, MANY iterations – hoping that it would finally converge.
This sometimes-converging and sometimes-not was frustrating for them. Not to mention, it was taking too much of their time.
It’s ok for all of us to experiment, but we don’t have infinite time for go-nowhere experimentation, so I stepped back to think about root cause.
The most likely trip-up? It wasn’t playing with our scaling parameters – the alpha inside the transfer function, or the eta that controlled our learning rate.
Most likely, the problem was that I was using the same sigmoid (logistic) transfer function in BOTH the hidden and output layers.
For a simple classification network, a sigmoid transfer function for the output nodes makes sense. 1 or 0, yes or no.
This tends to make things tricky for the hidden layers.
Here’s the equation that’s impacted; it is the change to a specific hidden-to-output connection weight, v(h,o).
If the inputs to a hidden node are negative, then the output of that node is close to zero, which means that it’s hard to backpropagate a change value if that hidden node’s output is involved. (That would be to the H-sub-h in the equation above.)
The most likely useful fix is to use two different transfer functions:
- Output nodes: Sigmoid (logistic) transfer function, and
- Hidden nodes: Tanh (hyperbolic tangent) transfer function.
It’s time for me to add just a bit more to that code. It may very well be that experimenting with two versions – one that is sigmoid (logistic) function only and another one that uses the sigmoid for the output layer and the hyperbolic tangent (tanh) for the hidden layer – will give us a very nice side-by-side comparison.
Here’s a brief recap of transfer functions, including the best links that I could find while prepping this week.
Classic Transfer Functions: Sigmoid and Tanh
When functional, problem-solving neural networks emerged in the late 1980’s, two kinds of transfer functions were most often used: the logistic (sigmoid) function and the hyperbolic tangent (tanh) function. Both of these functions are continuous (smooth), monotonically increasing, and bounded. The sigmoid function is bounded between 0 and 1, and the hyperbolic tangent (tanh) function is bounded between -1 and 1.
When I first started working with neural networks, I used the sigmoid function. Even now, it’s a very good choice for an output unit transfer function, because it gives an output value between 0 and 1. This is useful if we’re doing a classification neural network.
The hyperbolic tangent (tanh) function has often been much more effective within the neural network itself; i.e., within the hidden nodes. The reason is that when we use the sigmoid function, a set of negative inputs into the node causes the sigmoid transfer function to produce an output close to 0. That means that it is then hard to change the weights associated with that hidden node; it essentially becomes a dead node.
Sidenote: This last week, working with the XOR neural network, my students were getting frustrated with how often the network failed to converge. They had bias nodes into both the hidden and output nodes, but the network that I’d given them used the sigmoid transfer function throughout. I bet that if I rewrote that code so that there was a hyperbolic tanget transfer function at the hidden node, and still kept the sigmoid transfer function at the output node, we’d get more convergence. That’s a little project for this week.
Current Transfer Functions: ReLU
More recently, the ReLU (Rectified Linear Unit) function has become popular. It is neither smooth nor bounded, but works well in applications that have very large numbers of units (e.g., convolutional neural networks, or CNNs) as well as for the Restricted Boltzmann Machine (RBM).
The basic ReLU function is neither continuous nor smoothly differentiable. However, we can code around that small issue. Also, there is a lovely approximation to the ReLU that is continuous and smoothly differentiable.
Both functions are shown in the following figure.
Summary and What to Watch/Read
Quick Recap:
- Hyperbolic tangent (tanh) function: For hidden nodes in most neural architectures that are trained using backpropagation, and for output nodes if you don’t mind the lower value being -1 instead of 0.
- Sigmoid (logistic) function: For output nodes in a network where you want the outputs to be between 0 and 1.
- ReLU: For hidden nodes in a convolutional neural network (CNN) where you are generating outputs of the convolutional layers, and for the hidden (“latent”) nodes in an RBM.
- Logarithmic ReLU approximation: For where you want ReLU-like properties but with a smooth differentiation.
The readings below are selected from some good web sources. I highly recommend the online videos. The discussions in various technical forums are very good.
I’ve identified two high-quality papers that have been frequently cited; Michael Jordan for the logistic function (a 1995 classic), and a more recent paper by Nair and Hinton on the ReLU. Both are good if you need something a bit heavy-weight.
Enjoy!
Live free or die, my friend –
AJ Maren
Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War
Good Online Video Tutorials
- Andrew Ng, Coursera lecture 32: Activation Functions A very nice lecture, covers all the basics: sigmoid (logistic) function, tanh, and the ReLu. Only 10 minutes, very pleasant.
- Andrej K.’s lecture on Backpropagation – also very good and worth 1 hr 10 minutes of your time.
Useful Online Technical Discussions
- ReLU and Softmax Activation Functions (Github: Kulbear)
- How Does Rectilinear Activation Function Solve the Vanishing Gradient Problem (Stackexchange)
Backpropagation-specific Online Technical Discussions
- Yes, You Should Understand Backpropagation, by Andrej Karpathy
- Deep Neural Network – Backpropagation with ReLU (Stackexchange)
- Approximation function for ReLu to use in backpropagation
Good Citable References
- Jordan, M. (1995). Why the logistic function? A tutorial discussion on probabilities and neural networks. Technical Report, Massachusetts Institute of Technology. pdf
- Nair, V., & Hinton, G.E. (2010). Rectified linear units improve restricted Boltzmann machines, Proc. ICML’10; Proc. 27th International Conference on Machine Learning (Haifa, Israel, June 21 – 24, 2010), 807-814. pdf
Previous Related Posts
- Backpropagation: Not Dead, Not Yet – previous week’s post; why backpropagation is important – not just for basic and deep networks, but also for understanding new learning algorithms.
- Deep Learning: The First Layer – an in-depth walk-through / talk-through the sigmoid transfer function; how it works, how it influences the weight changes in the backpropagation algorithm; similar arguments would apply to the tanh function.
- Getting Started in Deep Learning – a very nice introductory discussion of (semi-)deep architectures and the credit assignment problem; backpropagation-oriented.
2 thoughts on “Selecting a Neural Network Transfer Function: Classic vs. Current”