A Bayesian Perspective on Generalization and Stochastic Gradient Descent
Introduction
This work builds on Zhang et al.(2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. The authors consider two questions: how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? They show that the same phenomenon occurs in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. We also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy. We propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” [math]\displaystyle{ g \approx \epsilon N/B }[/math] where [math]\displaystyle{ ε }[/math] is the learning rate, [math]\displaystyle{ N }[/math] the training set size and [math]\displaystyle{ B }[/math] the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, [math]\displaystyle{ B_{opt} \propto \epsilon N }[/math] . They verify these predictions empirically.
Motivation
This paper shows Bayesian principles can explain many recent observations in the deep learning literature, while also discovering practical new insights. Zhang et al. (2016) trained deep convolutional networks on ImageNet and CIFAR10, achieving excellent accuracy on both training and test sets. They then took the same input images, but randomized the labels, and found that while their networks were now unable to generalize to the test set, they still memorized the training labels. They claimed these results contradict learning theory, although this claim is disputed (Kawaguchi et al., 2017; Dziugaite & Roy, 2017). Nonetheless, their results beg the question; if our models can assign arbitrary labels to the training set, why do they work so well in practice? Meanwhile, Keskar et al. (2016) observed that if we hold the learning rate fixed and increase the batch size, the test accuracy usually falls. This striking result shows improving our estimate of the full-batch gradient can harm performance. Goyal et al. (2017) observed a linear scaling rule between batch size and learning rate in a deep ResNet, while Hoffer et al. (2017) proposed a square root rule on theoretical grounds. Many authors have suggested “broad minima” whose curvature is small may generalize better than “sharp minima” whose curvature is large (Chaudhari et al., 2016; Hochreiter & Schmidhuber, 1997). Indeed, Dziugaite & Roy (2017) argued the results of Zhang et al. (2016) can be understood using “nonvacuous” PAC-Bayes generalization bounds which penalize sharp minima, while Keskar et al. (2016) showed stochastic gradient descent (SGD) finds wider minima as the batch size is reduced. However, Dinh et al. (2017) challenged this interpretation, by arguing that the curvature of a minimum can be arbitrarily increased by changing the model parameterization.
Contribution
The main contributions of this paper are to show that:
- The results of Zhang et al. (2016) are not unique to deep learning; it is observed the same phenomenon in a small “over-parameterized” linear model. It is demonstrated that this phenomenon is straightforwardly understood by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima but is invariant to the model parameterization.
- SGD integrates a stochastic differential equation whose “noise scale” [math]\displaystyle{ g ≈ εN/B }[/math], where
ε is the learning rate, [math]\displaystyle{ N }[/math] training set size and [math]\displaystyle{ B }[/math] batch size. Noise drives SGD away from sharp minima, and therefore there is an optimal batch size which maximizes the test set accuracy. This optimal batch size is proportional to the learning rate and training set size.
Main Results
References
Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le, "Don't Decay the Learning Rate, Increase the Batch Size", arXiv:1711.00489
Chaudhari, Pratik, et al. "Entropy-sgd: Biasing gradient descent into wide valleys." arXiv preprint arXiv:1611.01838 (2016).
Dziugaite, Gintare Karolina, and Daniel M. Roy. "Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data." arXiv preprint arXiv:1703.11008 (2017).
Germain, Pascal, et al. "Pac-bayesian theory meets bayesian inference." Advances in Neural Information Processing Systems. 2016.
Goyal, Priya, et al. "Accurate, large minibatch SGD: training imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).
Gull, Stephen F. "Bayesian inductive inference and maximum entropy." Maximum-entropy and Bayesian methods in science and engineering. Springer, Dordrecht, 1988. 53-74.
Hoffer, Elad, Itay Hubara, and Daniel Soudry. "Train longer, generalize better: closing the generalization gap in large batch training of neural networks." Advances in Neural Information Processing Systems. 2017. Kass, Robert E., and Adrian E. Raftery. "Bayes factors." Journal of the american statistical association 90.430 (1995): 773-795.
Kawaguchi, Kenji, Leslie Pack Kaelbling, and Yoshua Bengio. "Generalization in deep learning." arXiv preprint arXiv:1710.05468 (2017).
Keskar, Nitish Shirish, et al. "On large-batch training for deep learning: Generalization gap and sharp minima." arXiv preprint arXiv:1609.04836 (2016).
MacKay, David JC. "A practical Bayesian framework for backpropagation networks." Neural computation 4.3 (1992): 448-472.
Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016).