# Difference between revisions of "dropout"

Line 22: | Line 22: | ||

For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network. | For any layer <math>l </math>, <math>\bold r^{(l)} </math> is a vector of independent Bernoulli random variables each of which has probability of <math>p </math> of being 1. <math>\tilde {\bold y} </math> is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network. | ||

− | '' | + | '''Backpropagation in Dropout Case (Training)''' |

Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter. | Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter. | ||

+ | |||

+ | ''' Max-norm Regularization ''' | ||

+ | |||

+ | Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant <math>c </math>. Mathematically, if <math>\bold w </math> represents the vector of weights incident on any hidden unit, then we put constraint <math>||\bold w ||_2 \leq c </math>. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up. | ||

+ | |||

+ | '''Test Time''' | ||

+ | |||

+ | Suppose a neural net has n units, there will be <math>2^n </math> possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability <math>p </math> during training, the outgoing weights of that unit are multiplied by <math>p </math> at test time. Figure below shows the intuition. | ||

+ | |||

+ | = Effects of Dropout = | ||

+ | |||

+ | ''' Effect on Features ''' | ||

+ | |||

+ | Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image. | ||

+ | '''picture''' | ||

+ | |||

+ | ''' Effect on Sparsity ''' | ||

+ | |||

+ | Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a. | ||

+ | '''picture''' | ||

+ | |||

+ | '''Effect of Dropout Rate''' | ||

+ | |||

+ | The paper tested to determine the tunable hyperparameter <math>p </math>. The comparison is down in two situations: | ||

+ | 1. The number of hidden units is held constant. (fixed n) | ||

+ | 2. The expected number of hidden units that will be retained after dropout is held constant. (fixed <math>pn </math> ) | ||

+ | The optimal <math>p </math> in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal. | ||

+ | '''picture''' | ||

+ | |||

+ | '''Effect of Data Set Size''' | ||

+ | |||

+ | This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline. | ||

+ | |||

+ | '''picture''' | ||

− | |||

− | |||

= choice of p= | = choice of p= | ||

= data size = | = data size = | ||

= dropout RBF = | = dropout RBF = |

## Revision as of 20:32, 12 November 2015

# Introduction

Dropout is one of the techniques for preventing overfitting in deep neural network which contains a large number of parameters. The key idea is to randomly drop units from the neural network during training. During training, dropout samples from an exponential number of different “thinned” network. At test time, we approximate the effect of averaging the predictions of all these thinned networks.

**Demonstration**

by dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure 1. Each unit is retrained with probability p independent of other units (p is usually set to 0.5, which seems to be close to optimal for a wide range of networks and tasks).

# Model

Consider a neural network with [math]\ L [/math] hidden layer. Let [math]\bold{z^{(l)}} [/math] denote the vector inputs into layer [math] l [/math], [math]\bold{y}^{(l)} [/math] denote the vector of outputs from layer [math] l [/math]. [math]\ \bold{W}^{(l)} [/math] and [math]\ \bold{b}^{(l)} [/math] are the weights and biases at layer [math]l [/math]. With dropout, the feed-forward operation becomes:

- [math]\ r^{(l)}_j \sim Bernoulli(p) [/math]

- [math]\ \tilde{\bold{y}}^{(l)}=\bold{r}^{(l)} * \bold y^{(l)}[/math] , here * denotes an element-wise product.

- [math]\ z^{(l+1)}_i = \bold w^{(l+1)}_i\tilde {\bold y}^l+b^{(l+1)}_i [/math]

- [math]\ y^{(l+1)}_i=f(z^{(l+1)}_i) [/math]

For any layer [math]l [/math], [math]\bold r^{(l)} [/math] is a vector of independent Bernoulli random variables each of which has probability of [math]p [/math] of being 1. [math]\tilde {\bold y} [/math] is the input after we drop some hidden units. The rest of the model remains the same as the regular feed-forward neural network.

**Backpropagation in Dropout Case (Training)**

Dropout neural network can be trained using stochastic gradient descent in a manner similar to standard neural network. The only difference here is that we only back propagate on each thinned network. The gradient for each parameter are averaged over the training cases in each mini-batch. Any training case which does not use a parameter contributes a gradient of zero for that parameter.

** Max-norm Regularization **

Using dropout along with max-norm regularization provides a significant boost over just using dropout. Max-norm constrain the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant [math]c [/math]. Mathematically, if [math]\bold w [/math] represents the vector of weights incident on any hidden unit, then we put constraint [math]||\bold w ||_2 \leq c [/math]. A justification for this constrain is that it makes the model possible to use huge learning rate without the possibility of weights blowing up.

**Test Time**

Suppose a neural net has n units, there will be [math]2^n [/math] possible thinned neural networks. It is not feasible to explicitly run exponentially many thinned models and average them to obtain a prediction. Thus, at test time, the idea is to use a single neural net without dropout. The weight of this network are scaled-down versions of the trained weights. If a unit is retained with probability [math]p [/math] during training, the outgoing weights of that unit are multiplied by [math]p [/math] at test time. Figure below shows the intuition.

# Effects of Dropout

** Effect on Features **

Dropout breaks the co-adaptations between hidden units. Firgure 7a shows that each hidden unit has co-adapted in order to produce good reconstructions. Each hidden unit its own does not detect meaningful feature. In figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the image.
**picture**

** Effect on Sparsity **

Sparsity helps preventing overfitting. In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low. Comparing the histograms of activations we can see that fewer hidden units have high activations in Figure 8b compared to figure 8a.
**picture**

**Effect of Dropout Rate**

The paper tested to determine the tunable hyperparameter [math]p [/math]. The comparison is down in two situations:

```
1. The number of hidden units is held constant. (fixed n)
2. The expected number of hidden units that will be retained after dropout is held constant. (fixed [math]pn [/math] )
```

The optimal [math]p [/math] in case 1 is between (0.4, 0.8 ); while the one in case 2 is 0.6. The usual default value in practice is 0.5 which is close to optimal.
**picture**

**Effect of Data Set Size**

This section explores the effect of changing data set size when dropout is used with feed-forward networks. From Figure 10, apparently, dropout does not give any improvement in small data sets(100, 500). As the size of the data set is increasing, then gain from doing dropout increases up to a point and then decline.

**picture**