Deep study of a not very deep neural network. Part 3a: Optimizers overview

10 min readMay 4, 2018

In the previous part we have seen, how different activation functions perform with RMSProp optimizer. Now we will test various optimizers with the same set of activation functions, to see which combination works best.

The list of optimizers we will be comparing consists of:

Stochastic Gradient Descent;
RMSProp;
Adam;
AdaGrad;
AdaDelta;
AdaMax;
Nadam.

These optimizers are also implemented in Keras, and can be used out-of-the box. Each optimizer has a few parameters with default values set according to the original papers. We will use these defaults, except for RMSProp, where as in the previous part, the learning rate will be set to 0.001 instead of 0.01. When I was preparing the experiment I found that the value of 0.001 for RMSProp makes the training results more comparable to the results of other optimizers.

Optimizers

As this is a not introductory tutorial, I assume that the reader already knows what is an optimizer, and how it works. There are lots of good sources on optimizers for those who would like to get a basic understanding, like this and this. Just to recap, optimizer is an algorithm, which purpose is to minimize the objective function. In plain words, it calculates, how to change the weights of your neural network, so that the error becomes lower on each iteration.

As with activation functions, over the last decades many optimizers have been developed, and it is not so easy to a beginner to pick the right one. Hundreds of research papers present, discuss and compare various optimizers, and often the results of comparing them are contradictory. You may see that in one paper the experiment shows that certain optimizer is better than others, and in another paper in a different experiment that optimizer has performed worse. It is a common case in Deep Learning, that there’s no ‘silver bullet’, and the choice of an optimizer may be determined by the nature of the data, the architecture of a neural network, the formulation of the problem, and many other factors. Besides that, almost every optimizer has a number of parameters, which may have a significant effect on the quality of training. It is often hard to find the explanation of these parameters, and most of other tutorials almost always use the default values, except for the learning rate.

In this series we will be comparing optimizers when applied to image classification problem with a fully-connected neural network. But before that, I would like to demonstrate the differences of each optimizer, and, more importantly, demonstrate how changing certain parameters of an optimizer affect the training process.

For this we will run another experiment, this time even much more simple one.

Consider a linear function y = a*x + b. We will train a neural network consisting of just one layer with one neuron and linear activation, and its aim will be to find the values a and b. Yes, it is a plain old linear regression, and there’s no point doing it using neural networks. But this example will allow us to visualize the training process, and see, how the optimizer comes to the target values.

As you may remember, each neuron has weight and bias values. So the output of this neuron with linear activation will be weight * input + bias, which looks in the same way as our target function. Therefore the optimal weight and bias found for this neuron will represent the a and b values of our function.

The neuron weight will be initialized with the value 0 and the bias with the value 0. Optimizers will be minimizing Mean Squared Error, which for this starting point is equal to 0.065929635, and the number of epochs will be set to 100.

The notebook with the code for this experiment is available on my github, so you may experiment with other target functions.

I have randomly chosen a and b to be equal to 0.1 and 0.3 respectively, and now let’s have a look, whether optimizers have managed to find these values, how changing optimizer parameters affect the training process, and how did each optimizer was coming to the solution.

Stochastic Gradient Descent

In Keras, SGD, besides learning rate, has a number of parameters: momentum and nesterov. Momentum determines how much the previous gradients will affect the current one. It accelerates SGD by navigating along the relevant direction and softens the oscillations in irrelevant directions. Nesterov parameter tells, whether to use Nesterov variant of SGD, which prevents the optimizer from missing the minima if the momentum is too high.

This is how the learning process progresses with various values for momentum with nesterov = False:

Fig.1 Visualisation of the optimizer paths. SGD with different momentum values

You can see that for very low momentum the oscillations are nearly absent, and the weights changes very slowly. The models with momentum >= 0.9 were unable to get close to the minima (which is known to us). When the momentum is too high, the oscillations are large, and we can see that with momentum 0.99 the optimizer first missed the minima, then started converging. The best momentum value has been 0.95, resulting in relatively stable training and the lowest error.

Now compare that to the case when nesterov parameter is set to True:

Fig.2 Visualisation of the optimizer paths. SGD with different momentum values and Nesterov = True

The training has become much more stable, and despite the fact that again the optimizer with 0.99 momentum has initially missed the minima, it has still resulted in the lowest MSE after 100 epochs. In both cases optimizers with the default momentum value of 0.0 were not able to come close to the minima by the end of the training. Therefore we can say that the usage of the right momentum value accelerates training, and nesterov makes it more stable. However, it is not always true that setting nesterov to true or setting higher momentum would be able to produce better results. Look at this graph:

Fig.3 Visualisation of the optimizer paths. SGD with too high momentum values

It looks like someone gave a pencil to a 2-year-old child. Momentum values that are too high, as well as too low result in very bad performance. Also, for momentum = 0.95 the optimizer without nesterov slightly outperforms the one with the same momentum with nesterov:

From this experiment it is evident that momentum or nesterov alone will not be able to produce lower error. Instead, you should use nesterov and try several momentum values in order to find the one that gives the best results.

AdaGrad

AdaGrad is a variation of SGD, but with learning rate that is adjusted separately for each parameter on each step according to the size of the gradients for this parameter. When gradient change is small, AdaGrad decreases the learning rate less, so that it moves towards the optimum faster. When the gradient is large, the learning rate is decreased more, resulting in smaller steps, so that the optimizer does not jump over the optimum.

AdaGrad does not have any tunable parameters, except for the initial learning rate and learning rate decay, so here I will show you only the training process with several decay rates:

Fig.4 Visualisation of the optimizer paths. AdaGrad

The training goes very stable, and as we see, the best decay rate was 0.0. Actually, in AdaGrad setting decay rate doesn’t make much sense, because the learning rate will be adjusted properly anyway.

RMSProp (Root Mean Square Propagation)

This optimizer combines the ideas from momentum-based SGD (the usage of the exponential moving average of the past gradients) and AdaGrad (adapting the learning rate).

It has the same parameters, like learning rate and decay, but also uses RHO parameter, which is similar to momentum in SGD. Let’s test different RHO values with 0.0 decay and 0.01 learning rate:

Fig.4 Visualisation of the optimizer paths. RMSprop with different RHO values

The training is very stable for all RHO values, and almost all of the models have been able to get very close to the minima, except for RHO = 0.0. Notable ones are with RHO > 0.99. For 0.9999 the first step was in a very wrong direction, but contrary to what we have seen with SGD’s momentums above 0.99, the final result has been fantastic:

AdaDelta

Another extension of AdaGrad optimizer, which instead of accumulating all past gradients, uses only last N of them. The parameters of this optimizer are similar to the ones of RMSProp. This is how it performs with different RHO values:

Note that for RHO values below 0.9999 ‘the higher the better’ rule does not work. Also the training has been very slow, despite choosing very large initial learning rate, so keep in mind that with this optimizer you may need more iterations than with other ones.

Adam

Stands for adaptive moment estimation, an optimizer that is similar to AdaDelta and RMSProp, but in addition to what these two do, it keeps an exponentially decaying average of past gradients. Beta_1 and beta_2 parameters are the decay rates, which by default are equal to 0.9 and 0.999 respectively. The image below confirms that these values are not the only best ones for our specific case, and the other good combinations are {0.95, 0.999}, {0.9, 0.9999} and {0.95, 0.9999}:

The closer beta_1 to 0.9, the less the optimizer accounts the previous gradients, and the ‘jumps’ to the sides are weaker. And with higher beta_2 the optimzers’s velocity towards the minima. However, when these two values are too high, the optimizer misses the minima because of high velocity gained on the previous steps. It is interesting that all optimizers except for the last two have perfectly found the a = 0.1 and b = 0.3 values.

AMSGrad

Adam also has another variant — AMSGrad (in Keras, it is controlled by setting amsgrad = True for Adam optimizer), which uses the maximum of past squared gradients in order to allow the rarely-occurring minibatches with large and informative gradients to have greater influence on the overall direction, otherwise diminished by exponential averaging in plain Adam.

This is how the training process graph looks for AMSGrad:

Fig.7 Visualisation of the optimizer paths. Adam with AMSgrad and different beta_1 and beta_2 values

Quite similar to Adam: again the 0.99 value for beta_1 resulted in jumping over the minima, and the rest have managed to find the desired values. The visual difference between regular Adam and AMSGrad is that in the latter the optimizer’s path is more stable.

AdaMax

This optimizer takes the idea of Adam one step further, and for the current gradient takes the maximum of beta_2 x past gradients and the current gradient. This results in a smaller and smoother movement towards the minima:

Note that the default value for beta_2 (0.999) was not the best in our case: lower beta_2 values (0.95–0.99) demonstrate better performance.

Nadam

The last optimizer we are exploring is again the development of the previously discussed ones. Nadam (Nesterov-accelerated Adaptive Moment Estimation), as you may understand from its name, combines combines Adam and Nesterov-accelerated SGD, so it is both stable and reaches the minima very closely:

The best two optimizers here used beta_1 equal to 0.95 and 0.99 and beta_2 = 0.99. While the authors suggest different values, this example shows that it is always worth trying various combinations, to see what works best for each specific case.

Conclusion

It is interesting that only Adam (and its AMSGrad variation) and RMSProp optimizers were able to perfectly find the a and b values. That doesn’t mean that more sophisticated optimizers are worse. It always depends on the task, the configuration of the network, the data and many other factors.

And of course, there are more optimizers out there used in training neural networks. There is a whole area in mathematics devoted specifically to the search of more advanced and sophisticated optimization methods and algorithms. I hope this brief overview gave you an intuition of what your options are, and what are the differences in terms of the training process. You may find the source code and the final rankings of the optimizers in my notebook on github.

In the next part we will come back to our simple neural network, and test these optimizers with various parameters, to see which is best.

I’m always happy to meet new people and share ideas, so if you liked the article, cosider adding me on LinkedIn.

Deep study of a not very deep neural network series:

Part 1: What’s in our data
Part 2: Activation functions
Part 3a: Optimizers overview
Part 3b: Choosing an optimizer
Part 4: How to find the right learning rate
Part 5: Dropout and Noise
Part 6: Weights initialization
Part 7: Regularization
Part 8: Batch normalization
Part 9: Size matters
Part 10: Merging it all together

Deep study of a not very deep neural network. Part 3a: Optimizers overview

Optimizers

Stochastic Gradient Descent

AdaGrad

RMSProp (Root Mean Square Propagation)

AdaDelta

Adam

AMSGrad

AdaMax

Nadam

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Rinat Maksutov

Responses (2)