Deep study of a not very deep neural network. Part 5: Dropout and Noise

Rinat Maksutov
6 min readJul 11, 2018

In the previous experiments there were a few cases when the validation accuracy has been gradually declining after reaching some maximum, which is a clear sign of overfitting. It happens when the network adjusts its weights so that on the training set it shows very good results (the error gets lower), but on the test set the error starts to get higher. Essentially, the network ‘memorizes’ the training data, but doesn’t work well on the data it hasn’t seen and hasn’t been trained on.

In this part we will discuss two approaches to deal with this issue: Dropout and adding noise to the layer inputs.

Dropout

Dropout has been introduced a few years ago by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov in their paper called “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”.

If you already have some experience with neural networks, you may already know that the main idea of dropout is to randomly turn-off some of the units in a neural network on each iteration of the training together with all its input and output connections, so that they are not affected by the gradient updates:

Source: Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov (2014)

This technique prevents overfitting and makes the network more stable. How does it do that?

First, it prevents the network’s units from ‘learning’ the training data. When all units are always on, they will eventually become adapted to the training data so that the error is the lowest possible. On the other hand, when on each iteration some of the units are off, in forward pass the error is computed only on the remaining units, and only their weights are changed in the backward flow. This prevents the units from co-adapting on training data, and allows for better generalization.

Second, by dropping a certain percentage of units, on each iteration you are essentially getting a new network architecture. And by repeating this many times, what happens is that you are training many different networks, each with different (but shared) parameters. And when it is time to make predictions (dropout is only active at training time), the resulting network roughly represents the combination of all these networks, which, as the authors say, “nearly always improves the performance of machine learning methods”.

In essence, the authors’ summary of Dropout idea is:

Complex co-adaptations can be trained to work well on a training set, but on novel test data they are far more likely to fail than multiple simpler co-adaptations that achieve the same thing.

A variation of the dropout is AlphaDropout. The authors of “Self-Normalizing Neural Networks” paper discovered that instead of setting activations to 0 (which happens in regular Dropout and works well for ReLU activation), setting them to the negative saturation value of the SELU activation function produces better results when used with this activation. This way AlphaDropout keeps mean and variance of the inputs of a layer with SELU activation to their original values, in order to ensure the self-normalizing property even after dropout.

Noise

Adding noise is another way to prevent a neural network from ‘learning’ the training data. Noise is a matrix containing small random values (different on each training iteration), which are added to the outputs of a layer. This way consequent layers will not be able to co-adapt too much to the outputs of the previous layers.

Keras has two implementations of the noise layers:

  • GaussianNoise: general noise layer, which adds zero-centered noise, with specified standard deviation;
  • GaussianDropout: a combination of Dropout and 1-centered Gaussian noise; the rate specifies the percentage of units to be dropped, and the standard deviation of the noise is calculated as sqrt(dropout_rate / (1 — dropout_rate)).

What to keep in mind when using Dropout

The paper mentioned at the beginning of this article lists several points that have to be considered if you decide to use Dropout. I’ll just cite them here with some minor edits from my side:

  • Network Size. Dropping units reduces the capacity of a neural network. If n is the number of hidden units in any layer and p is the dropout rate, then after dropout only pn units will remain. Therefore, if an n-sized layer is optimal for a standard neural net on a given task, a good dropout net should have at least n/(1 — p) units.
  • Learning Rate and Momentum. Dropout introduces a significant amount of noise in the gradients compared to standard stochastic gradient descent. Therefore, a lot of gradients tend to cancel each other. In order to make up for this, a dropout net should typically use 10–100 times the learning rate that was optimal for a standard neural net or to use a high momentum. While momentum values of 0.9 are common for standard nets, with dropout values around 0.95 to 0.99 work quite a lot better. Using high learning rate and/or momentum significantly speed up learning.
  • Dropout Rate. Typical values of dropout rate are in the range 0.2 to 0.5. For input layers, the choice depends on the kind of input. For real-valued inputs (image patches or speech frames), a typical value is 0.2. For hidden layers, the choice of the dropout rate is coupled with the choice of number of hidden units n. Higher rate requires big n which slows down the training and leads to underfitting. Smaller rate may not produce enough dropout to prevent overfitting.

Testing Dropout and Noise

The results of this experiment have been quite surprising. If we look at the best accuracy for each combination of the activation, optimizer and data type, this is what we see:

In this table there are no regular Dropout and AlphaDropout. The top results for all combinations (except for the SGD + Sigmoid) are achieved by adding Gaussian noise, and the best ones are the combination of the Gaussian noise and Dropout. The improvement compared to the default configuration (Dropout with 0.2 rate) varies from 0.1% to nearly 0.3% accuracy, which is quite a lot.

Another finding is that in the majority of cases the lower rate (0.1) for dropout variations provide higher validation accuracy:

The value of 0.1 does not fall into the range recommended in the paper mentioned at the beginning of this article, and this is another case when we find that the default values or those recommended in research papers produce sub-optimal results. And this is the best time to once again repeat the idea that flows through this series: never trust the default values, because what worked well for one case may not work for another.

So let’s summarize what we found in this experiment:

  • Gaussian dropout and Gaussian noise may be a better choice than regular Dropout;
  • Lower dropout rates (<0.2) may lead to better accuracy, and still prevent overfitting.

Next time we will switch to a completely different topic, and will investigate, how the initial weights of our network’s layers affect the results of the training. See you there!

I’m always happy to meet new people and share ideas, so if you liked the article, cosider adding me on LinkedIn.

Deep study of a not very deep neural network series:

--

--

Rinat Maksutov

Technology consultant with experience in mobile and web development, artificial intelligence and systems architecture.