Deep study of a not very deep neural network. Part 4: How to find the right learning rate

Rinat Maksutov
8 min readMay 29, 2018

In the previous parts of the series we discussed the training process from the perspective of activation functions and optimizers. And there were several cases when the accuracy stops improving, and the optimizer was jumping around the minima without being able to get closer to it. There I said that this can probably be fixed by adjusting another hyperparameter of our neural network — the learning rate.

As with many other parameters, you would often read that there is no strict rule on choosing the optimum learning rate right from the start. You are told to experiment with various values and identify the one that works best for your specific case. You are also told that often it is beneficial to change the learning rate as the training progresses. In this part we will discuss, how do we choose the initial learning rate, and compare some of the techniques of adjusting it over the process of training.

Why learning rate matters

Learning rate, as you may already know, controls the size of the step an optimizer takes towards the minima of the loss function. Remember the experiment we’ve performed when visualizing the process of training. Let’s do it once again, but now only for various learning rate values:

Fig.1 Visualization of the training process for SGD optimizer with various learning rates

Here we see that in most of the cases the optimizers were not able to get close to the minima in the given time. Here, among other reasons, the step sizes were too small for them to be able to move fast enough, so they stopped quite far from the minima point. The optimizer with relatively high learning rate on the other hand was able to nearly reach the minima, and given enough training time, it would eventually get to the desired point.

The difference in step sizes is clearly seen on the first several epochs: for the red line representing the 0.01 rate the intervals are shorter, whereas for other rates, they are significantly larger. This is the reason why the latter have progressed much further over the same number of epochs.

Let’s test various learning rates with another optimizer — RMSProp:

Fig.2 Visualization of the training process for RMSprop optimizer with various learning rates

Here all optimizers were able to get close to the area surrounding the minima, but the ones having higher learning rates (0.2, 0.1 and 0.05) got stuck a bit further from it. Their initial steps were too large, and therefore they have slightly ‘missed’ the path to the right direction, and were not able to come back on track. This is because the gradients for the bias at some point became too small for allowing them to adjust the direction of the movement, while the gradients for the kernel have still been relatively large (compare the scale of the bias and kernel axes).

Now consider another example:

Fig.3 Validation accuracy of RMSProp optimizer with Leaky ReLU activation with alpha = 0.01

This graph illustrates the process of training a network with RMSProp optimizer and Leaky ReLU activation from part 2 of this series. The maximum accuracy values have been reached around 20–30th epoch and then it has been floating up and down until the end of the training. This means that the optimizers were walking around the minima, but weren’t able to get closer to it, because the learning rate (and the step size) was too high. If we were to reduce the learning rate at some point of the training process, the optimizers would have been able take smaller steps, and eventually could have found a better optima, resulting in higher accuracy.

These examples illustrate the idea that choosing the right initial learning rate plays an important role in the ability of the neural network to get as close to the minima as possible, and it is essential to try several rates (at least on a limited number of epochs) and identify the one that both allows fast enough progress and at the same time does not make the optimizer to get stuck around the minima. While selecting the right initial learning rate is a matter of experimenting, controlling it during the training can be easily done using certain techniques.

Learning rate decay

The most primitive technique is to reduce the learning rate on each iteration of the training by some very small value (decay rate). This way, as the training progresses, the steps will become smaller and smaller, potentially allowing the optimizer not to miss the minima. However, using this technique has one large disadvantage: it is very likely that the learning rate will be decreased too early, so that at some point the optimizer will not be able take steps large enough to make any significant progress. So again, this becomes the matter of experimenting with various decay rates and epoch numbers, which takes quite a lot of time.

In Keras it is possible to control the learning rate decay using the decay parameter of a chosen optimizer.

Reducing the learning rate on plateau

A smarter approach is to adjust the learning rate only when the optimizer cannot improve the results over some number of epochs. This situation tells us that the optimizer has reached some plateau, and in order to improve the results and be able to move down the error surface towards the new minima, it needs the step size to be reduced. So the idea here is simple:

  1. Watch the target training metric (the loss, for example);
  2. If it does not improve over a certain period, e.g. 5 or 10 epochs, reduce the learning rate by some factor, like new_learning_rate = old_learning_rate * raduce_factor;
  3. Wait for some cooldown period, letting the optimizer to explore the error surface using the new learning rate;
  4. After the cooldown period ends, repeat step 1.

The benefits of such approach are clear: we only reduce the learning rate when it is really needed, and give the optimizer enough time to find a better path towards the minima. Keras has a very convenient callback called ReduceLROnPlateau, that implements learning rate adjustment.

Now let’s come back to our main experiment, and test our neural network with several learning rates and various learning rate adjustment options.

Fixed LR vs Decay vs ReduceLROnPlateau

In order to save some time when testing the various learning rate adjustment options, I have limited the number of optimizers and activations that are used in our experiment, leaving SELU and Sigmoid for activations options and Adamax and SGD for optimizers. I believe, they adequately represent the variations in activation functions and optimization approach, and allow me to complete the experiment in reasonable time. Also, this time I’ve increased the number of epochs to train the models to 200 epochs, in order to allow the models with lower learning rates to demonstrate their limits.

After training the models and averaging their results here’s what we get for SELU + Adamax combination:

The baseline values here are the ones that have Fixed LR with rate = 0.002. We see that only a few configurations have managed to perform better that these defaults — and all are only for the standardized data. They allowed to increase the averaged accuracy by 0.05–0.15%. For normalized data the default fixed learning rate has still been the best.

In our case Adamax has benefited from a lower Fixed learning rate (0.001 as opposed to the default 0.002), but the biggest increase was achieved by using ReduceLR with higher rate value. Here the higher the rate is, the less is the learning rate decrease: when the reduction rate is 0.5 the new learning rate is 0.5 * 0.002 = 0.001, and when the reduction rate is 0.2, the new learning rate is 0.2 * 0.002 = 0.0004. Such a large decrease of the learning rate may slow down the training too much.

Learning rate decay method has also demonstrated some accuracy increase on the standardized data, but for normalized it has been among the worst.

Fig.4 Average validation accuracy of Adamax optimizer with SELU activation with various learning rate adjustment options

It is interesting that the situation for Sigmoid + Adamax combination is the opposite:

The higher accuracy values are now for the normalized data, and the default (Fixed LR = 0.002) is the highest. Learning rate decay now is among the outsiders.

Fig.5 Average validation accuracy of Adamax optimizer with Sigmoid activation with various learning rate adjustment options

As for the SGD optimizer with SELU the rankings are different again:

The default learning rate for SGD is 0.01, and in this case we see that a higher fixed LR = 0.1 resulted in the best accuracy, both for normalized and standardized data.

Fig.7 Average validation accuracy of SGD optimizer with SELU activation with various learning rate adjustment options

The results for the SGD + Sigmoid combination are almost identical at the top of the table, so I won’t be putting it here. The graph demonstrates this clearly:

Fig.8 Average validation accuracy of SGD optimizer with Sigmoid activation with various learning rate adjustment options

Now comes the most interesting part. If we look at the learning rate management options that resulted in the highest accuracy for each combination of activation / optimizer and data, then this is what we get:

For every combination, except for the SELU + Adamax + Standardized data, Fixed default learning rate has been the best. This means that my hypothesis in part 3 that the results can be improved by tweaking the learning rate, was wrong. But not entirely, as we can see from the very first row of this table: for a specific case using ReduceLR can improve your model, so never ignore this opportunity, but keep the reduction rate higher, so that the learning rate is not reduced too much and too early.

As a reminder, it must be noted here that these results are specific to the particular case we are considering. Fixed learning rate will not always produce best results, otherwise the means for adjusting it wouldn’t have been invented. When designing a neural network, always try different approaches in terms of learning rate management, and see what works best for your problem.

As always, the code is available on the github, and in the next part we will explore other parts of nearly every neural network architecture: Dropout and Noise, which make the trained models much more stable and help to reduce overfitting. Stay tuned!

I’m always happy to meet new people and share ideas, so if you liked the article, cosider adding me on LinkedIn.

Deep study of a not very deep neural network series:

--

--

Rinat Maksutov

Technology consultant with experience in mobile and web development, artificial intelligence and systems architecture.