Improving neural networks by preventing co-adaption of feature detectors
Kyle Jung, Dae Hyun Kim, Seokho Lim, Stan Lee
Introduction to Dropout + Dataset
Consisting of recordings of 630 speakers of 8 dialects of American English each reading 10 phonetically-rich sentences, the TIMIT is a standard dataset used for evaluation of automatic speech recognition systems. The objective is to convert a given speech signal into a transcription sequence of phones. Hidden Markov Models (HMMs) is an acoustic model that is typically used to deal with variance and determines a level of fit from coefficients of input to each state of HMMs. Recent results show that mapping feedforward neural networks with an acoustic input coupled with a probability distribution over HMM states perform better than the traditional Gaussian mixture models on speech recognition datasets including TIMIT.
A Neural network was constructed to output the classification error rate on the test set of TIMIT dataset. They have built the neural network with four fully-connected hidden layers with 4000 neurons per layer. The output layer distinguishes distinct classes from one hundred 185 softmax output neurons that are merged into 39 classes. After constructing the neural network, 21 adjacent frames with an advance of 10ms per frame was given as an input. The results show that applying dropout with 50% of hidden units on various neural networks exceed classification performance from the neural networks without dropout. The decoder, a network that knows transition probabilities between HMM states, runs the Viterbi algorithm on class probabilities for each frame from the output of the neural network to predict the best single sequence of HMM states. The classification error achieved 19.7% with dropout and 22.7% without dropout.
Deep Belief Network was used to pretrain the neural network. Since the inputs are real-valued, Gaussian RBM was used for pretraining the first layer. Initializing visible biases with zero, weights were sampled from random numbers that followed normal distribution N(0, 0.01). Each visible neuron’s variance was set to 1.0 and remained unchanged during training. Minimizing Contrastive Divergence (CD) was used to facilitate learning. Since momentum is used to speed up learning, it was initially set to 0.5 and increased linearly to 0.9 over 20 epochs. The average gradient had 0.001 of a learning rate which was then multiplied by (1-momentum) and L2 weight decay was set to 0.001. After setting up the hyperparameters, the model was done training after 100 epochs. Binary RBMs were used for training all subsequent layers with a learning rate of 0.01. Then, p was set as the mean activation of a neuron in the data set and the visible bias of each neuron was initialized to log(p/(1 − p)). Training each layer with 50 epochs, all remaining hyper-parameters were the same as those for the Gaussian RBM.
The initial weights were set in a neural network from the pretrained RBMs. To finetune the network with dropout-backpropagation, momentum was initially set to 0.5 and increased linearly up to 0.9 over 10 epochs. The model had a small constant learning rate of 1.0 and it was used to apply to the average gradient on a minibatch. The model also retained all other hyperparameters the same as the model from MNIST dropout finetuning. The model required approximately 200 epochs to converge. For comparison purpose, they also finetuned the same network with standard backpropagation with a learning rate of 0.1 with the same hyperparameters.
Comparing the performance of dropout with standard backpropagation on several network architectures and input representations, dropout consistently achieved lower error and cross-entropy. Results showed that it significantly controls overfitting, making the method robust to choices of network architecture. It also allowed much larger nets to be trained and removed the need for early stopping. Neural network architectures with dropout are not very sensitive to the choice of learning rate and momentum.
Models for CIFAR-10:
CIFAR-10 is a popular object recognition dataset with size 32 x 32 color images searched from the web. It contains 10 classes and the images were labels with the noun used to search the image. It has images of 6000 train images and 1000 test images of a single dominant object from the label name for each 10 classes.
They implemented two different models for CIFAR-10, one with dropout and the other without. The one with dropout enables us to use more parameters because dropout forces a strong regularization on the network, and a fourth weight layer is added to take the input from the previous pooling layer. We add a fourth weight layer that is locally connected but not convolutional and this layer contains 16 banks of filters of size 3 × 3 (50% dropout). And then, the softmax layer takes its input from this fourth weight layer.
The one without dropout is a CNN with three convolutional layers each with a pooling layer. The max-pooling method is performed by the pooling layer which follows the first convolutional layer, and the average-pooling method is performed by remaining 2 pooling layers. The first and second pooling layers with N = 9, α = 0.001, and β = 0.75 are followed by response normalization layers.
A ten-unit softmax layer, which is used to output a probability distribution over class labels, is connected with the upper-most pooling layer. Using filter size of 5×5, all convolutional layers have 64 filter banks.
Thus, with a neural network with 3 convolutional hidden layers with 3 max-pooling layers, the classification error achieved 16.6% to beat 18.5% from the best published error rate without using transformed data. Then, adding one locally-connected layer after these 6 layers and dropout at the last hidden layer produced the error rate of 15.6%.
ImageNet is a dataset of millions of high-resolution labeled images in thousands of categories which were collected from the web and labelled by human labellers using MTerk tool (Amazon’s Mechanical Turk crowd-sourcing tool). Because this dataset has millions of labeled images in thousands of categories, it is very difficult to have perfect accuracy on this dataset even for humans because the ImageNet images contain multiple instances of ImageNet objects and there are a large number of object classes. ImageNet and CIFAR-10 are very similar, but the scale of ImageNet is about 20 times bigger (1,300,000 vs 60,000). The size of ImageNet is about 1.3 million training images, 50,000 validation images, and 150,000 testing images. They used resized images of 256 x 256 pixels for their experiments.
Example of ambiguous image to label:
When this paper was written, the best score on this dataset is 45.7% by High-dimensional signature compression for large-scale image classification (J. Sanchez, F. Perronnin, CVPR11 (2011)). The authors of this paper could achieve a comparable performance of 48.6% error using a single neural network with five convolutional hidden layers with a max-pooling layer in between, followed by two globally connected layers and a final 1000-way softmax layer. Also, 42.4% could be achieved by using 50% dropout in the 6th hidden layer.
It was demonstrated that making a large number of decisions was important for the architecture of the net design for the speech recognition (TIMIT) and object recognition datasets (CIFAR-10 and ImageNet). A separate validation set which evaluated the performance of a large number of different architectures was used to make those decisions, and then they chose the best performance architecture with dropout on the validation set so that they could apply it to the real test set.
Models for ImageNet:
The models for ImageNet with dropout (the one without dropout had a similar approach, but there was a serious issue with overfitting): They used a convolutional neural network trained by 224×224 patches randomly extracted from the 256 × 256 images. It can reduce the network’s capacity to overfit the training data and helps generalization as a form of data augmentation. The method of averaging the prediction of the net on ten 224 × 224 patches of the 256 × 256 input image was used for a testing (patched at the center, the four corner patches, and their horizontal reflections).
To maximize the performance on the validation set, it was necessary to use the very complicated network architecture described above. They could show that dropout is a helpful factor for very complex neural nets that have been developed by the joint efforts of many groups over many years to be really good at object recognition. Using non-convolutional higher layers with a lot of parameters leads to a big improvement with dropout, but makes things worse without dropout. Training with dropout improved the performance even for a complicated model as described above.
Layers 1,2,3,4,5: convolutional layers - done Layer 6,7: globally-connected layers - done\ Layers 1,2,5: followed by Max-pooling layers - done Layer 1: 64 filter banks with 11 × 11 filters which it applies with a stride of 4 pixels Layer 2: has 256 filter banks with 5 × 5 filters. This layer takes two inputs: -The first input: the pooled and response-normalized output of the first convolutional layer. The 256 banks in this layer are divided arbitrarily into groups of 64, and each group connects to a unique random 16 channels from the first convolutional layer - The second input: a subsampled version of the original image (56 × 56) Layer 3,4,5: layers are connected to one another without any intervening pooling or normalization layers, but the max-with-zero nonlinearity is applied at each layer after linear filtering. Layer 3: has 512 filter banks divided into groups of 32, each group connecting to a unique random subset of 16 channels produced by the (pooled, normalized) outputs of the second convolutional layer. Layer 4,5: have 512 filter banks divided into groups of 32, each group connecting to a unique random subset of 32 channels produced by the layer below.