the loss surfaces of multilayer networks (Choromanska et al.)

From statwiki
Revision as of 09:46, 30 August 2017 by Conversion script (talk | contribs) (Conversion script moved page The loss surfaces of multilayer networks (Choromanska et al.) to the loss surfaces of multilayer networks (Choromanska et al.): Converting page titles to lowercase)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Overview

The paper Loss Surfaces of Multilayer Networks by Choromanska et al. is situated in the context of determining critical points (i.e. minima, maxima, or saddle points) of loss surfaces of deep multilayer network models, such as feedforward perceptrons.

The authors present a model of multilayer rectified linear units (ReLUs), and show that it may be expressed as a polynomial function of the parameter matrices in the network, with a polynomial degree equal to the number of layers. The ReLu units produce a piecewise, continuous polynomial, with monomials that are nonzero or zero at the boundaries between pieces. With this model, they study the distribution of critical points of the loss polynomial, providing an analysis with results from random matrix theory applied to spherical spin glasses.

The 3 key findings of this work are the following:

  • For large-size networks, most local minima are equivalent and yield similar performance on a test set.
  • The probability of finding a bad local minimum (i.e. one with a large value in terms of the loss function) may be large for small-size networks, but decreases quickly with network size.
  • Obtaining the global minimum of the loss function using a training dataset is not useful in practice.

Many theoretical results are reported, which will not be exhaustively covered here. However, a high-level overview of proof techniques will be given, followed by a summary of the experimental results.

Prior Work

Earlier work has shown, for high-dimensional random Gaussian error functions, that critical points with error much higher than the global minimum are very likely to be saddle points (e.g. <ref>Bray, A. J., & Dean, D. S. (2007). Statistics of critical points of Gaussian fields on large-dimensional spaces. Physical review letters, 98(15), 150201.</ref>). Furthermore, all local minima are likely to be very close in functional value to the global minimum. Dauphin et al. <ref> Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems (pp. 2933-2941). </ref> empirically show that the cost functions of neural networks behave similarly to Gaussian error functions in high-dimensional spaces, but no theoretical justification is provided. This is one of the main contributions of this paper.

In <ref>Auffinger, A., & Arous, G. B. (2013). Complexity of random smooth functions on the high-dimensional sphere. The Annals of Probability, 41(6), 4214-4247.</ref>, an asymptotic evaluation of the complexity of the spherical spin-glass model from condensed matter physics is provided. The authors found that the critical values with low Hamiltonian values have a layered structure that behaves like a Gaussian process. This work shows, under the assumptions listed in the overview, that the objective function used by a neural network is analogous to the Hamiltonian of the spin-glass problem. This means that they exhibit similar behaviour. This is not the first attempt at connecting the spin-glass problem with neural networks but none had attempted to optimize the neural network objective using the theory developed for the spin-glass problem. Thus, this paper is also novel in that respect.

Theoretical Analysis

Consider a simple fully-connected feed-forward deep network [math]\displaystyle{ \mathcal{N} }[/math] with a single output for a binary classification task. The authors use the convention that [math]\displaystyle{ (H-1) }[/math] denotes the number of hidden layers in the network (the input layer is the [math]\displaystyle{ 0^{\text{th}} }[/math] layer and the output layer is the [math]\displaystyle{ H^{\text{th}} }[/math] layer). The input [math]\displaystyle{ X }[/math] is a vector with [math]\displaystyle{ d }[/math] elements, assumed to be random. The variable [math]\displaystyle{ n_i }[/math] denotes the number of units in the [math]\displaystyle{ i^{\text{th}} }[/math] layer (due to the network restrictions, [math]\displaystyle{ n_0 = d }[/math] and [math]\displaystyle{ n_H = 1 }[/math]). Finally, [math]\displaystyle{ W_i }[/math] s the matrix of weights between [math]\displaystyle{ (i - 1)^{\text{th}} }[/math] and [math]\displaystyle{ i^{th} }[/math] layers of the network and [math]\displaystyle{ \sigma = \max(0,x) }[/math] is the activation function. For a random input [math]\displaystyle{ X }[/math], the random network output [math]\displaystyle{ Y }[/math] is [math]\displaystyle{ Y = q\sigma(W_H^{\top}\sigma(W_{H-1}^{\top}\dots\sigma(W_1^{\top}X)))\dots), }[/math] where [math]\displaystyle{ q }[/math] is a normalization factor.

The key assumption in the theoretical work is the following: for ReLu activation functions [math]\displaystyle{ \sigma(x) }[/math] for a random variable [math]\displaystyle{ x }[/math], the output can be seen as being equal to [math]\displaystyle{ \delta \cdot x }[/math], where [math]\displaystyle{ x }[/math] is a (not necessarily random) nonzero variable and [math]\displaystyle{ \delta }[/math] is a new random variable that is identically equal to either 0 or 1. With this in mind, the output of the network can be re-expressed as: [math]\displaystyle{ Y = q\sum_{i=1}^{n_0}X_{i}\sum_{j = 1}^\gamma A_{i,j}\prod_{k = 1}^{H}w_{i,j}^{(k)}, }[/math]

where [math]\displaystyle{ A_{i,j} }[/math] is a random variable equal to 0 or 1, denoting a path [math]\displaystyle{ (i,j) }[/math] to be active ([math]\displaystyle{ A_{i,j} = 1 }[/math]) or not ([math]\displaystyle{ A_{i,j} = 0 }[/math]). In this expression, the first summation over [math]\displaystyle{ i }[/math] is over the elements of the network input vector, and the second summation over [math]\displaystyle{ j }[/math] is over all paths from [math]\displaystyle{ X_i }[/math] to the output. The upper index on this second summation is [math]\displaystyle{ \gamma = n_1n_2\dots n_H }[/math] for all possible paths. The term [math]\displaystyle{ w_{i,j}^{(k)} }[/math] refers to the value of the parameter matrix in the layer that corresponds to the hidden vector element that produced the path (i.e. the [math]\displaystyle{ k^{\text{th}} }[/math] segment of path indexed with [math]\displaystyle{ (i,j) }[/math]); hence why there are [math]\displaystyle{ H w_{i,j}^{(k)} }[/math] terms per path.

From this equation, it can be seen that the output of the ReLu network is polynomial in the weight matrix parameters, and the treatment of [math]\displaystyle{ A_{i,j} }[/math] as a random indicator variable allows connections to be made with spin glass models.

The remainder of the theoretical analysis proceeds as follows:

  • The input vector [math]\displaystyle{ X }[/math] and all [math]\displaystyle{ \{A_{i,j}\} }[/math] are assumed to be random variables, where [math]\displaystyle{ A_{i,j} }[/math] is a Bernoulli random variable and all input elements of [math]\displaystyle{ X }[/math] are independent.

  • One further critical assumption is the spherical constraint; all parameter weights [math]\displaystyle{ w_i }[/math] (elements of the parameter matrices) satisfy a spherical bound:

    [math]\displaystyle{ 1/\Lambda \sum_i^\Lambda w_i^2 = C }[/math]

    for some [math]\displaystyle{ C \gt 0 }[/math] where [math]\displaystyle{ \Lambda }[/math] is the number of parameters.

  • These assumptions allow the network output to be modeled as a spherical spin glass model, which is a physical model for magnetic dipoles in ferromagnetic materials (a dipole has a magnetization state that is a binary random variable)

  • Using this assumption, the work by Auffinger et al. (2010) in the field of random matrices and spin glasses is then used to relate the energy states of system Hamiltonians of spin glass models to the critical points of the neural network loss function.

  • The analysis shows that the critical points of the loss function correspond to different energy bands in the spin glass model; as in a physical system, higher energy states are less probable; while the number of states is infinite, the probability of the system appearing in that state vanishes.

  • The energy barrier [math]\displaystyle{ E_{\infty} }[/math] stems from this analysis, and is given by

    [math]\displaystyle{ E_{\infty} = E_{\infty}(H) = 2\sqrt{\frac{H-1}{H}}. }[/math] Auffinger et al. show that critical values of the loss function must relate energies below [math]\displaystyle{ -\Lambda E_{\infty} }[/math] if their critical band index (i.e. energy index) is finite.

Experiments

The numerical experiments conducted were to verify the theoretical claims of the distribution of critical points around the energy bound [math]\displaystyle{ E_{\infty} }[/math], as well as to correlate the testing and training loss for different numbers of parameters [math]\displaystyle{ (\Lambda) }[/math] in the models.

MNIST Experiments

ReLu neural networks with a single layer and increasing [math]\displaystyle{ \Lambda \in \{25,50,100,250,500 \} }[/math] were training for multiclass classification on a scaled-down version of the MNIST digit dataset, where each image was downsampled to [math]\displaystyle{ 10 \times 10 }[/math] pixels. For each value of [math]\displaystyle{ \Lambda }[/math], 200 epochs of SGD with a decaying learning rate were used to optimize the parameters in the network. The optimization experiments were performed 1000 times with different initial values for the weight parameters drawn uniformly randomly from [math]\displaystyle{ [-1,1] }[/math].

Results

To evaluate the distribution of the energy states that each critical point (i.e. solution) of the loss function, the eigenvalues of the Hamiltonian matrix of the loss function was computed for the parameters after the optimization procdure completed. The distribution of the (normalized) index of the energy states is shown below in Fig 1. It can be seen that for all models with different numbers of parameters, the energy states occupied are the low energy bands.

File:index dist.png
Fig 1. Distribution of normalized indices of energy states as computed from the system Hamiltonian at the final values of the parameters after the SGD optimization procedure completed.

The final values of the loss function in these experiments are also shown in the histograms in Fig 2. Interestingly, the variance in the loss decreases with increasing numbers of parameters, despite the fact that the spread in the energy state (Fig. 1) increases. This shows that despite the fact that local minima are more prevalent for models with many parameters, there is no appreciable difference in the loss function at these minima: the minima are essentially all equally good in terms of minimizing the objective cost.

File:loss distribution.png
Empirical distribution of the values of the loss function over the course of 1000 experiments with different numbers of parameters (Lambda). Each experimental run used a different random initialization of the parameter weights.

Finally, a scatter plot of the training vs testing error for each model is shown in Fig. 3. It can be seen that the correlation between the two errors decreases as the number of parameters increases, suggesting that obtaining a global minimum would not necessarily produce better testing results (and hence still would have a sizeable generalization error).

File:train test corr.png
Fig 3. Scatter plots showing the the correlation between training and testing error for the MNIST dataset experiments. For few parameters in the network, there is a very strong correlation between the two errors. However, for networks with many more parameters, the correlations decrease in strength, suggesting that obtaining the optimal loss (critical point) in the training phase does not improve the generalization error.


Discussion

Power of Deep Neural Nets from the No Free Lunch Point View

A far out view for the explanation of why Deep Neural Networks has lower probability of bad local minimas is Woodward's <ref>Woodward, John R. "GA or GP? that is not the question." Evolutionary Computation, 2003. CEC'03. The 2003 Congress on. Vol. 2. IEEE, 2003.</ref> paper on why the No Free Lunch Theorem (NFLT) doesn't hold. First the NFLT is a theorem that basically states if one cannot incorporate domain specific knowledge into the search or optimization algorithm, one cannot guarantee the search/optimization algorithm can out perform (in terms of convergence speed) any other search/optimization algorithms, this implies there could be no universal search algorithm that is the best.

Woodward's argument is that whether you use Genetic Algorithm or Genetic Programming doesn't matter, what matters is the solution mapping. Consider the case the task of Symbolic Regression where we have 2 algorithms [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], let [math]\displaystyle{ P_{s} = \{+ba, a, b, +aa, +bb, +ab\} }[/math] and [math]\displaystyle{ Q_{s} = \{+ab, +ba, a, b, +aa, +bb\} }[/math] be the time ordered explored solutions of [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math]. If the problem we face requires a solution [math]\displaystyle{ +ab }[/math] then both [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] discovers the solution on their first try. However for any other solution algorithm [math]\displaystyle{ P }[/math] will always outperform [math]\displaystyle{ Q }[/math].

From the above we might be able to conclude that for deep neural networks the larger or deeper the network size, the more likely the network connections will be able to generate a function that minimizes the loss faster then smaller networks (since all networks were trained for 200 epochs, theoretically a single layer MLP should be able to approximate any function, but practically that could take forever), thus minimizing chances of bad local minimas (similar to how if you have a complex function you have a better chance of fitting data over a simpler function).

References

<references/>