# Introduction

Dropout<ref>https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf</ref> is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks.

By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p can be set using a validation set, or can be set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).

# Model

Consider a neural network with $\ L$ hidden layer. Let $\bold{z^{(l)}}$ denote the vector inputs into layer $l$, $\bold{y}^{(l)}$ denote the vector of outputs from layer $l$. $\ \bold{W}^{(l)}$ and $\ \bold{b}^{(l)}$ are the weights and biases at layer $l$. With dropout, the feed-forward operation becomes:

$\ r^{(l)}_j \sim Bernoulli(p)$
$\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}$ , here * denotes an element-wise product.
$\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i$
$\ y^{(l+1)}_i=f(z^{(l+1)}_i)$ , where $f$ is the activation function.

For any layer $l$, $\bold r^{(l)}$ is a vector of independent Bernoulli random variables each of which has probability of $p$ of being 1. $\tilde {\bold y}$ is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.

Backpropagation in Dropout Case (Training)

Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.

Dropout can also be applied to finetune nets that have been pretrained using stacks of RBMs, autoencoders or Deep Boltzmann Machines. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. This is done by performing the regular pretraining methods (RBMs, autoencoders, ... etc). After pretraining, the weights are scaled up by factor $1/p$, and then dropout finetuning is applied. The learning rate should be a smaller one to retain the information in the pretrained weights.

Max-norm Regularization

Using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant $c$. Mathematically, if $\bold w$ represents the vector of weights incident on any hidden unit, then we put constraint $||\bold w ||_2 \leq c$. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up.

Unsupervised Pretraining

Neural networks can be pretrained using stacks of RBMs<ref name=GeH> Hinton, Geoffrey, et al "Reducing the dimensionality of data with neural networks." in Science,, (2006). </ref> , autoencoders<ref name=ViP> Vincent, Pascal, et al "Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion." in Proceedings of the 27th International Conference on Machine Learning,, (2010). </ref> or Deep Boltzmann Machines<ref name=SaR> Salakhutdinov , Ruslan, et al [http://www.utstat.toronto.edu/~rsalakhu/papers/dbm.pdf "Deep Boltzmann Machines ."] in Proceedings of the International Conference on Artificial Intelligence and Statistics(2009). </ref>. Pretraining is an effective way of making use of unlabeled data. Pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. Dropout can be applied to finetune nets that have been pretrained using these techniques. The pretraining procedure stays the same.

Test Time

Suppose a neural net has n units, there will be $2^n$ possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability $p$ during training, the outgoing weights of that unit are multiplied by $p$ at test time. Figure below shows the intuition.

Multiplicative Gaussian Noise

Dropout takes Bernoulli distributed random variables which take the value 1 with probability $p$ and 0 otherwise. This idea can be generalized by multiplying the activations by a random variable drawn from $\mathcal{N}(1, 1)$. It works just as well, or perhaps better than using Bernoulli noise. That is, each hidden activation $h_i$ is perturbed to $h_i+h_ir$ where $r \sim \mathcal{N}(0,1)$, which equals to $h_ir'$ where $r' \sim \mathcal{N}(1, 1)$. We can generalize this to $r' \sim \mathcal{N}(1, \sigma^2)$ which $\sigma^2$ is a hyperparameter to tune.

## Applying dropout to linear regression

Let $X \in \mathbb{R}^{N\times D}$ be a data matrix of N data points. $\mathbf{y}\in \mathbb{R}^N$ be a vector of targets.Linear regression tries to find a $\mathbf{w}\in \mathbb{R}^D$ that maximizes $\parallel \mathbf{y}-X\mathbf{w}\parallel^2$.

When the input $X$ is dropped out such that any input dimension is retained with probability $p$, the input can be expressed as $R*X$ where $R\in \{0,1\}^{N\times D}$ is a random matrix with $R_{ij}\sim Bernoulli(p)$ and $*$ denotes element-wise product. Marginalizing the noise, the objective function becomes

$\min_{\mathbf{w}} \mathbb{E}_{R\sim Bernoulli(p)}[\parallel \mathbf{y}-(R*X)\mathbf{w}\parallel^2 ]$

This reduce to

$\min_{\mathbf{w}} \parallel \mathbf{y}-pX\mathbf{w}\parallel^2+p(1-p)\parallel \Gamma\mathbf{w}\parallel^2$

where $\Gamma=(diag(X^TX))^{\frac{1}{2}}$. Therefore, dropout with linear regression is equivalent to ridge regression with a particular form for $\Gamma$. This form of $\Gamma$ essentially scales the weight cost for weight $w_i$ by the standard deviation of the $i$th dimension of the data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight more.

## Bayesian Neural Networks and Dropout

For some data set $\,{(x_i,y_i)}^n_{i=1}$, the Bayesian approach to estimating $\,y_{n+1}$ given $\,x_{n+1}$ is to pick some prior distribution, $\,P(\theta)$, and assign probabilities for $\,y_{n+1}$ using the posterior distribution based on the prior distribution and the data set.

The general formula is:

$\,P(y_{n+1}|y_1,\dots,y_n,x_1,\dots,x_n,x_{n+1})=\int P(y_{n+1}|x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta$

To obtain a prediction, it is common to take the expected value of this distribution to get the formula:

$\,\hat y_{n+1}=\int y_{n+1}P(y_{n+1}|x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta$

This formula can be applied to a neural network by thinking of $\,\theta$ as all of the parameters in the neural network and $\,P(y_{n+1}|x_{n+1},\theta)$ can be thought as the output of the neural network given some set of weights and the input. Since the output of a neural network is fixed and the probability is 1 for a single output and 0 for all other possible outputs, the formula can be rewritten as:

$\,\hat y_{n+1}=\int f(x_{n+1},\theta)P(\theta|y_1,\dots,y_n,x_1,\dots,x_n)d\theta$

Where $\,f(x_{n+1},\theta)$ is the output of the neural network given some weights and input. By taking a closer look at this expected values formula, it is essentially the average of infinitely many possible neural network outputs weighted by its probability of occurring given the data set.

In the dropout model, the researchers are doing something very similar in that they take the average of the outputs of a wide variety of models with different weights but unlike Bayesian neural networks where each of these outputs and their respective models are weighted by their proper probability of occurring, the dropout model assigns equal probability to each model. This necessarily impacts the accuracy of dropout neural networks compared to Bayesian neural networks but have very strong advantages in training speed and ability to scale.

Despite the erroneous probability weighting compared to Bayesian neural networks, the researchers compared the two models and found that while it is less accurate, it is still better than standard neural network models and can be seen in their chart below, higher is better:

# Effects of Dropout

Effect on Features

In a standard neural network, units may change in a way that they fix up the mistakes of the other units, which may lead to complex co-adaptations and overfitting because these co-adaptations do not generalize to unseen data. Dropout breaks the co-adaptations between hidden units by making the presence of other units unreliable. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. File:feature.png

Effect on Sparsity

Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a. File:sparsity.png

Effect of Dropout Rate

The paper tested to determine the tunable hyperparameter $p$. The comparison is down in two situations:

 1. The number of hidden units is held constant. (fixed n)
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed $pn$ )


The optimal $p$ in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. File:pvalue.png

Effect of Data Set Size

This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline.

# Comparison

The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations. Dropout + Max-norm outperforms all other chosen methods. The result is below:

# Result

The author performed dropout on MNIST data and did comparison among different methods. The MNIST data set consists of 28 X 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes. From the result table, Deep Botlzman Machine + dropout finetuning outperforms with only 0.79% Error rate.

In order to test the robustness of dropout, they did classification experiments with networks of many different architectures keeping all hyperparameters fixed. The figure below shows the test error rates obtained for these different architectures as training progresses. Dropout gives a huge improvement across all architectures.

The author also apply dropout scheme on many neural networks and test on different datasets, such as Street View House Numbers (SVHN), CIFAR, ImageNet and TIMIT dataset. Adding dropout can always reduce the error rate and further improve the performance of neural networks.

# Conclusion

Dropout is a technique to prevent overfitting in deep neural network which has large number of parameters. It can also be extended to Restricted Boltzmann Machine and other graphical models, eg(Convolutional network). One drawback of dropout is that it increases training time. This creates a trade-off between overfitting and training time.

<references />