Deep study of a not very deep neural network. Part 3b: Choosing an optimizer

Rinat Maksutov
8 min readMay 8, 2018

In this part we continue discussing optimizers. In the previous article I have shown on a very simple example how optimizers work and how the choice of parameters affect the way they walk towards the minima. Now it is time to test them on the MNIST dataset, and see which one is the best.

Ways to evaluate optimizer’s performance

The obvious way to name the best optimizer is by ranking them by the resulting validation accuracy by the end of the training. However, this is tricky. As you have seen in the previous part, some optimizers may reach the minima very quickly, whereas others will be moving very slowly. If you train for 100 epochs, you may not be sure that the value you see at the 100th epoch may not improve at the 101th epoch.

Moreover, the initialization of the weights of our neural network (which we will discuss in the future parts) implies randomness, therefore each time you train your networks, the final results may differ significantly for the same set of parameters. In order to make the results comparable it seems reasonable to average the results of each optimizer across some number of experiments.

Another point of randomness for many types of optimizers is computing and applying gradients. Remember SGD, which takes a random mini-batch to compute the error and decide, which direction to move in the next step. This causes situations, when in the middle of the training you see that the optimizer got lucky and received a very representative mini-batch which allowed him to greatly improve the accuracy across the entire validation set. But in the next step it receives a not very balanced mini-batch, the accuracy drops and never returns back to the maximum. There are ways to deal with such situations, which will be covered later in the series, but for now let’s assume they do not exist. In this case it is worth exploring, how far the maximum accuracy value observed over the entire training process is from the accuracy value obtained on the last epoch. If the difference is large, we may say that the training was unstable, and the maximum accuracy we observe for a particular configuration is just a one-time event, which we cannot rely on. On the other hand, if the observed maximum accuracy is close to the value on the last epoch, we may be pretty sure that the results of this particular configuration are reproducible.

To sum up, we will be testing out optimizers from two perspectives: what is the maximum accuracy averaged across five experiments, and how far this maximum is from the average accuracy value on the last epoch.

Experiment results — Quick facts

For your convenience, here’s the link to the entire rankings table. (I’ve separated the configurations, which have not converged at all or which have the accuracy values lower than 90, into separate sheet.)

In part 2, when testing different activation functions with RMSProp optimizer the maximum observed accuracy was 98.2 for SELU activation, and the maximum averaged accuracy was 98.02 for ELU activation, both on normalized data. This time by changing the optimizer and fine-tuning the parameters we’ve been able to significantly improve these results. The averaged maximum is now 98.17 for ELU activation with customized Adamax optimizer on normalized data, and the overall maximum is 98.37 for SELU activation with customized Adam on standardized data.

Now let’s examine the top-20 configurations, sorted by averaged max:

First thing that you may spot is that there are only two activations are present in this top-20: ELU and SELU. In fact, they fully occupy the top-100, with the only exception of SoftPlus occuring 3 times in the entire top-100.

Next, there is no single optimizer which can be called “the best”. The difference between the first and the 20th place is just 0.08%. Then, the majority of top-performing optimizers have non-default parameters. There are a few cases when optimizers trained with default parameters performed slightly better than the customized ones, but in general, we see that fine-tuning optimizer’s parameters leads to higher accuracy.

Also, there’s no clear preference in terms of data transformation type: both normalized and standardized types are equally present.

What is notable here is that the difference between the maximum achieved value and the value at the last epoch may differ very much, as well as the difference between the same averaged maximum and the maximum across five experiments. At the top there are both very stable configurations (e.g. #1, 3) and those for which these differences are relatively large (#12, 16).

Here’s how the general picture looks like:

Fig.1 Average maximum accuracy for various optimizers

We see that Adam-based optimizers perform higher, with Adamax showing the best results. I understand that averaging them is not very scientific. It is better to compare specific configurations of these optimizers. Nevertheless this picture is a good illustration of the general trends in the results of my experiment. Just to confirm this trend let’s look at the same averaged values, but only for ELU activation:

Fig.2 Average maximum accuracy for various optimizers with ELU activation

Adam and Adamax again are the best, followed by Adadelta and RMSProp.

Now let’s have a closer look at each optimizer and discuss what worked well for each of them.

SGD

Using the right momentum value significantly improves the accuracy: in our case the value of 0.9 demonstrated the best results both on normalized and standardized data. Nesterov momentum did not affect the accuracy very much, but in other tasks the difference may be larger. So the recommendation here is to always use Nesterov — it won’t make it worse.

This diagram shows the max accuracy averaged across all activation functions. This way you can get the idea of the general performance of various configurations of the optimizer and not be overwhelmed by the amount of columns.

Fig.3 Average maximum accuracy for SGD optimizer with various parameters

In regards of the data transformation types and activations, SGD performs better on standardized data with ELU and SELU activations.

Adagrad

This optimizer does not have any tunable parameters, so here we are able to dig one level deeper and show the performance for each activation:

Fig.4 Average maximum accuracy for AdaGrad optimizer

What is evident here is the optimizer’s strong preference of standardized data. And as before, ELU activation was the best, with ReLU seem to be coming second.

RMSProp

Here the default configuration was not the best:

Fig.5 Average maximum accuracy for RMSProp optimizer with various parameters

Higher RHO values lead to higher accuracy for both standardized and normalized data. However, when RHO is too high (0.999), the performance drops significantly. So the recommendation is to test the values in the range 0.9–0.99 when using this optimizer.

Adadelta

Very similar to RMSProp, RHO values in the range 0.9–0.99 are good:

Fig.6 Average maximum accuracy for AdaDelta optimizer with various parameters

Adam

This optimizer demonstrated very close results for almost all configurations. The differences between their performance becomes evident when we set the x-scale to a very short range:

Fig.7 Average maximum accuracy for Adam optimizer with various parameters

It is notable that the default configuration has shown the best results, and the optimizer performs better on standardized data. Also, setting AMSGrad to true generally improves the results.

Adamax

Fig.8 Average maximum accuracy for Adamax optimizer with various parameters

Here the default configuration was again the best, and the same story with the standardized data. The good choice from beta_1 is 0.9–0.95 and for beta_2 is 0.999.

Nadam

Fig.9 Average maximum accuracy for Nadam optimizer with various parameters

Here we don’t see that standardized data is prefered over the normalized one, and custom configurations work better than the default one. The optimal beta_1 values are the same as for Adamax, but for beta_2 the value of 0.99 is nearly as good as 0.999.

Which optimizer is the best?

In short, there is no one definite answer to this question. Our experiment demonstrated that it depends on how you adjust the parameters of the optimizer, and on how you configure your neural network. In our case Adam and Adamax optimizers with default configurations were better than the others, but this may not be the case in other tasks and with another data. As always, you should check several options and play with the optimizers’ parameters to see which ones work better for your case.

What is also worth noting the important thing to consider when choosing an optimizer may be its stability. The notebook on my github visualizes the process of training of each configuration, and from it you can see, for example, that SGD, despite having slightly lower accuracy, leads to more consistent results, than other optimizers. I have found a very good discussion of this issue:

https://www.quora.com/Why-do-the-state-of-the-art-deep-learning-models-like-ResNet-and-DenseNet-use-SGD-with-momentum-over-Adam-for-training

Probably, if we have given SGD some more training time and fine-tuned our network a bit more, eventually it would have reached much higher accuracy values.

It is time to wrap up our discussion of the optimizers, and move on to the next part, where we will talk about Learning rate hyperparameter, and how changing it may push the model’s accuracy even higher. See you in the next part!

I’m always happy to meet new people and share ideas, so if you liked the article, cosider adding me on LinkedIn.

Deep study of a not very deep neural network series:

--

--

Rinat Maksutov

Technology consultant with experience in mobile and web development, artificial intelligence and systems architecture.