From statwiki
Revision as of 22:16, 2 November 2015 by Lruan (talk | contribs) (Model)
Jump to: navigation, search


Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks.


by dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).


Consider a neural network with [math]\ L [/math] hidden layer. Let [math]\bold{z^{(l)}} [/math] denote the vector inputs into layer [math] l [/math], [math]\bold{y}^{(l)} [/math] denote the vector of outputs from layer [math] l [/math]. [math]\ \bold{W}^{(l)} [/math] and [math]\ \bold{b}^{(l)} [/math] are the weights and biases at layer [math]l [/math]. With dropout, the feed-forward operation becomes:

[math]\ r^{(l)}_j \sim Bernoulli(p) [/math]
[math]\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}[/math] , here * denotes an element-wise product.
[math]\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i [/math]
[math]\ y^{(l+1)}_i=f(z^{(l+1)}_i) [/math]

For any layer [math]l [/math], [math]\bold r^{(l)} [/math] is a vector of independent Bernoulli random variables each of which has probability of [math]p [/math] of being 1. [math]\tilde {\bold y} [/math] is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.

backpropagation in dropout case

Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.