the loss surfaces of multilayer networks (Choromanska et al.)
Overview
The paper Loss Surfaces of Multilayer Networks by Choromanska et al. is situated in the context of determining critical points (i.e. minima, maxima, or saddle points) of loss surfaces of deep multilayer network models, such as feedforward perceptrons.
The authors present a model of multilayer rectified linear units (ReLUs), and show that it may be expressed as a polynomial function of the parameter matrices in the network, with a polynomial degree equal to the number of layers. The ReLu units produce a piecewise, continuous polynomial, with monomials that are nonzero or zero at the boundaries between pieces. With this model, they study the distribution of critical points of the loss polynomial, providing an analysis with results from random matrix theory applied to spherical spin glasses.
The 3 key findings of this work are the following:
- For large-size networks, most local minima are equivalent and yield similar performance on a test set.
- The probability of finding a bad local minimum (i.e. one with a large value in terms of the loss function) may be large for small-size networks, but decreases quickly with network size.
- Obtaining the global minimum of the loss function using a training dataset is not useful in practice.
Many theoretical results are reported, which will not be exhaustively covered here. However, a high-level overview of proof techniques will be given, followed by a summary of the experimental results.
Theoretical Analysis
Consider a simple fully-connected feed-forward deep network [math]\displaystyle{ \mathcal{N} }[/math] with a single output for a binary classification task. The authors use the convention that [math]\displaystyle{ (H-1) }[/math] denotes the number of hidden layers in the network (the input layer is the [math]\displaystyle{ 0^{\text{th}} }[/math] layer and the output layer is the [math]\displaystyle{ H^{\text{th}} }[/math] layer). The input [math]\displaystyle{ X }[/math] is a vector with [math]\displaystyle{ d }[/math] elements, assumed to be random. The variable [math]\displaystyle{ n_i }[/math] denotes the number of units in the [math]\displaystyle{ i^{\text{th}} }[/math] layer (due to the network restrictions, [math]\displaystyle{ n_0 = d }[/math] and [math]\displaystyle{ n_H = 1 }[/math]). Finally, [math]\displaystyle{ W_i }[/math] s the matrix of weights between [math]\displaystyle{ (i - 1)^{\text{th}} }[/math] and [math]\displaystyle{ i^{th} }[/math] layers of the network and [math]\displaystyle{ \sigma = \max(0,x) }[/math] is the activation function. For a random input [math]\displaystyle{ X }[/math], the random network output [math]\displaystyle{ Y }[/math] is [math]\displaystyle{ Y = q\sigma(W_H^{\top}\sigma(W_{H-1}^{\top}\dots\sigma(W_1^{\top}X)))\dots), }[/math] where [math]\displaystyle{ q }[/math] is a normalization factor.
The key assumption in the theoretical work is the following: for ReLu activation functions [math]\displaystyle{ \sigma(x) }[/math] for a random variable [math]\displaystyle{ x }[/math], the output can be seen as being equal to [math]\displaystyle{ \delta \cdot x }[/math], where [math]\displaystyle{ x }[/math] is a (not necessarily random) nonzero variable and [math]\displaystyle{ \delta }[/math] is a new random variable that is identically equal to either 0 or 1. With this in mind, the output of the network can be re-expressed as: [math]\displaystyle{ Y = q\sum_{i=1}^{n_0}X_{i}\sum_{j = 1}^\gamma A_{i,j}\prod_{k = 1}^{H}w_{i,j}^{(k)}, }[/math]
where [math]\displaystyle{ A_{i,j} }[/math] is a random variable equal to 0 or 1, denoting a path [math]\displaystyle{ (i,j) }[/math] to be active ([math]\displaystyle{ A_{i,j} = 1 }[/math]) or not ([math]\displaystyle{ A_{i,j} = 0 }[/math]). In this expression, the first summation over [math]\displaystyle{ i }[/math] is over the elements of the network input vector, and the second summation over [math]\displaystyle{ j }[/math] is over all paths from [math]\displaystyle{ X_i }[/math] to the output. The upper index on this second summation is [math]\displaystyle{ \gamma = n_1n_2\dots n_H }[/math] for all possible paths. The term [math]\displaystyle{ w_{i,j}^{(k)} }[/math] refers to the value of the parameter matrix in layer that corresponds to the hidden vector element that produced the path (i.e. the [math]\displaystyle{ k^{\text{th}} }[/math] segment of path indexed with [math]\displaystyle{ (i,j) }[/math]); hence why there are [math]\displaystyle{ H }[/math] [math]\displaystyle{ w_{i,j}^(k) }[/math] terms per path.
From this equation, it can be seen that the output of the ReLu network is polynomial in the weight matrix parameters, and the treatment of [math]\displaystyle{ A_{i,j} }[/math] as a random indicator variable allows connections to be made with spin glass models.
The remainder of the theoretical analysis proceeds as follows:
The input vector [math]\displaystyle{ X }[/math] and all [math]\displaystyle{ \{A_{i,j}\} }[/math] are assumed to be random variables, where [math]\displaystyle{ A_{i,j} }[/math] is a Bernoulli random variable and all input elements of [math]\displaystyle{ X }[/math] are independent.
One further critical assumption is the spherical constraint; all parameter weights [math]\displaystyle{ w_i }[/math] (elements of the parameter matrices) satisfy a spherical bound:
[math]\displaystyle{ 1/\Lambda \sum_i^\Lambda w_i^2 = C }[/math]
for some [math]\displaystyle{ C \gt 0 }[/math] where [math]\displaystyle{ \Lambda }[/math] is the number of parameters.
These assumptions allow the network output to be modeled as a spherical spin glass model, which is a physical model for magnetic dipoles in ferromagnetic materials (a dipole has a magnetization state that is a binary random variable)
Using this assumption, the work by Auffinger et al. (2010) in the field of random matrices and spin glasses is then used to relate the energy states of system Hamiltonians of spin glass models to the critical points of the neural network loss function.
The analysis shows that the critical points of the loss function correspond to different energy bands in the spin glass model; as in a physical system, higher energy states are less probable; while the number of states is infinite, the probability of the system appearing in that state vanishes.
The energy barrier [math]\displaystyle{ E_{\infty} }[/math] stems from this analysis, and is given by
[math]\displaystyle{ E_{\infty} = E_{\infty}(H) = 2\sqrt{\frac{H-1}{H}}. }[/math] Auffinger et al. show that critical values of the loss function must relate energies below [math]\displaystyle{ -\Lambda E_{\infty} }[/math] if their critical band index (i.e. energy index) is finite.
Experiments
The numerical experiments conducted were to verify the theoretical claims of the distribution of critical points around the energy bound [math]\displaystyle{ E_{\infty} }[/math], as well as to correlate the testing and training loss for different numbers of parameters [math]\displaystyle{ (\Lambda) }[/math] in the models.
MNIST Experiments
ReLu neural networks with a single layer and increasing [math]\displaystyle{ \Lambda \in \{25,50,100,250,500 \} }[/math] were training for multiclass classification on a scaled-down version of the MNIST digit dataset, where each image was downsampled to [math]\displaystyle{ 10 \times 10 }[/math] pixels. For each value of [math]\displaystyle{ \Lambda }[/math], 200 epochs of SGD with a decaying learning rate were used to optimize the parameters in the network. The optimization experiments were performed 1000 times with different initial values for the weight parameters drawn uniformly randomly from [math]\displaystyle{ [-1,1] }[/math].
Results
To evaluate the distribution of the energy states that each critical point (i.e. solution) of the loss function, the eigenvalues of the Hamiltonian matrix of the loss function was computed for the parameters after the optimization procdure completed. The distribution of the (normalized) index of the energy states is shown below in Fig 1. It can be seen that for all models with different numbers of parameters, the energy states occupied are the low energy bands.
The final values of the loss function in these experiments are also shown in the histograms in Fig 2. Interestingly, the variance in the loss decreases with increasing numbers of parameters, despite the fact that the spread in the energy state (Fig. 1) increases. This shows that despite the fact that local minima are more prevalent for models with many parameters, there is no appreciable difference in the loss function at these minima: the minima are essentially all equally good in terms of minimizing the objective cost.
Finally, a scatter plot of the training vs testing error for each model is shown in Fig. 3. It can be seen that the correlation between the two errors decreases as the number of parameters increases, suggesting that obtaining a global minimum would not necessarily produce better testing results (and hence still would have a sizeable generalization error).
Discussion
Power of Deep Neural Nets from the No Free Lunch Point View
A far out view for the explanation of why Deep Neural Networks has lower probability of bad local minimas is Woodward's <ref>Woodward, John R. "GA or GP? that is not the question." Evolutionary Computation, 2003. CEC'03. The 2003 Congress on. Vol. 2. IEEE, 2003.</ref> paper on why the No Free Lunch Theorem (NFLT) doesn't hold. First the NFLT is a theorem that basically states if one cannot incorporate domain specific knowledge into the search or optimization algorithm, one cannot guarantee the search/optimization algorithm can out perform (in terms of convergence speed) any other search/optimization algorithms, this implies there could be no universal search algorithm that is the best.
Woodward's argument is that whether you use Genetic Algorithm or Genetic Programming doesn't matter, what matters is the solution mapping. Consider the case the task of Symbolic Regression where we have 2 algorithms [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], let [math]\displaystyle{ P_{s} = \{+ba, a, b, +aa, +bb, +ab\} }[/math] and [math]\displaystyle{ Q_{s} = \{+ab, +ba, a, b, +aa, +bb\} }[/math] be the time ordered explored solutions of [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math]. If the problem we face requires a solution [math]\displaystyle{ +ab }[/math] then both [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] discovers the solution on their first try. However for any other solution algorithm [math]\displaystyle{ P }[/math] will always outperform [math]\displaystyle{ Q }[/math].
If we equate convergence speed with quality, then for deep neural networks the larger or deeper the network size, the more likely the network connections will be able to generate a function that minimizes the loss the most over smaller networks, thus minimizing chances of bad local minimas.