the loss surfaces of multilayer networks (Choromanska et al.): Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 51: Line 51:


= Results =
= Results =
<center>
[[File:index_dist.png | frame | center |Distribution of normalized indices of energy states as computed from the system Hamiltonian at the final values of the parameters after the SGD optimization procedure completed.]]
</center>
<center>
[[File:loss_distribution.png | frame | center | Empirical distribution of the values of the loss function over the course of 1000 experiments with different numbers of parameters (Lambda). Each experimental run used a different random initialization of the parameter weights.]]
</center>
<center>
[[File:train_test_corr.png | frame | center |Scatter plots showing the the correlation between training and testing error for the MNIST dataset experiments. For few parameters in the network, there is a very strong correlation between the two errors. However, for networks with many more parameters, the correlations decrease in strength, suggesting that obtaining the optimal loss (critical point) in the training phase does not improve the generalization error. ]]
</center>

Revision as of 00:36, 19 November 2015

Overview

The paper Loss Surfaces of Multilayer Networks by Choromanska et al. is situated in the context of determining critical points (i.e. minima, maxima, or saddle points) of loss surfaces of deep multilayer network models, such as feedforward perceptrons.

The authors present a model of multilayer rectified linear units (ReLUs), and show that it may be expressed as a polynomial function of the parameter matrices in the network, with a polynomial degree equal to the number of layers. The ReLu units produce a piecewise, continuous polynomial, with monomials that are nonzero or zero at the boundaries between pieces. With this model, they study the distribution of critical points of the loss polynomial, providing an analysis with results from random matrix theory applied to spherical spin glasses.

The 3 key findings of this work are the following:

  • For large-size networks, most local minima are equivalent and yield similar performance on a test set.
  • The probability of finding a bad local minimum (i.e. one with a large value in terms of the loss function) may be large for small-size networks, but decreases quickly with network size.
  • Obtaining the global minimum of the loss function using a training dataset is not useful in practice.

Many theoretical results are reported, which will not be exhaustively covered here. However, a high-level overview of proof techniques will be given, followed by a summary of the experimental results.

Theoretical Analysis

Consider a simple fully-connected feed-forward deep network [math]\displaystyle{ \mathcal{N} }[/math] with a single output for a binary classification task. The authors use the convention that [math]\displaystyle{ (H-1) }[/math] denotes the number of hidden layers in the network (the input layer is the [math]\displaystyle{ 0^{\text{th}} }[/math] layer and the output layer is the [math]\displaystyle{ H^{\text{th}} }[/math] layer). The input [math]\displaystyle{ X }[/math] is a vector with [math]\displaystyle{ d }[/math] elements, assumed to be random. The variable [math]\displaystyle{ n_i }[/math] denotes the number of units in the [math]\displaystyle{ i^{\text{th}} }[/math] layer (due to the network restrictions, [math]\displaystyle{ n_0 = d }[/math] and [math]\displaystyle{ n_H = 1 }[/math]). Finally, [math]\displaystyle{ W_i }[/math] s the matrix of weights between [math]\displaystyle{ (i - 1)^{\text{th}} }[/math] and [math]\displaystyle{ i^{th} }[/math] layers of the network and [math]\displaystyle{ \sigma = \max(0,x) }[/math] is the activation function. For a random input [math]\displaystyle{ X }[/math], the random network output [math]\displaystyle{ Y }[/math] is [math]\displaystyle{ Y = q\sigma(W_H^{\top}\sigma(W_{H-1}^{\top}\dots\sigma(W_1^{\top}X)))\dots), }[/math] where [math]\displaystyle{ q }[/math] is a normalization factor.

The key assumption in the theoretical work is the following: for ReLu activation functions [math]\displaystyle{ \sigma(x) }[/math] for a random variable [math]\displaystyle{ x }[/math], the output can be seen as being equal to [math]\displaystyle{ \delta \cdot x }[/math], where [math]\displaystyle{ x }[/math] is a (not necessarily random) nonzero variable and [math]\displaystyle{ \delta }[/math] is a new random variable that is identically equal to either 0 or 1. With this in mind, the output of the network can be re-expressed as: [math]\displaystyle{ Y = q\sum_{i=1}^{n_0}X_{i}\sum_{j = 1}^\gamma A_{i,j}\prod_{k = 1}^{H}w_{i,j}^{(k)}, }[/math]

where [math]\displaystyle{ A_{i,j} }[/math] is a random variable equal to 0 or 1, denoting a path [math]\displaystyle{ (i,j) }[/math] to be active ([math]\displaystyle{ A_{i,j} = 1 }[/math]) or not ([math]\displaystyle{ A_{i,j} = 0 }[/math]). In this expression, the first summation over [math]\displaystyle{ i }[/math] is over the elements of the network input vector, and the second summation over [math]\displaystyle{ j }[/math] is over all paths from [math]\displaystyle{ X_i }[/math] to the output. The upper index on this second summation is [math]\displaystyle{ \gamma = n_1n_2\dots n_H }[/math] for all possible paths. The term [math]\displaystyle{ w_{i,j}^{(k)} }[/math] refers to the value of the parameter matrix in layer that corresponds to the hidden vector element that produced the path (i.e. the [math]\displaystyle{ k^{\text{th}} }[/math] segment of path indexed with [math]\displaystyle{ (i,j) }[/math]); hence why there are [math]\displaystyle{ H }[/math] [math]\displaystyle{ w_{i,j}^(k) }[/math] terms per path.

From this equation, it can be seen that the output of the ReLu network is polynomial in the weight matrix parameters, and the treatment of [math]\displaystyle{ A_{i,j} }[/math] as a random indicator variable allows connections to be made with spin glass models.

The remainder of the theoretical analysis proceeds as follows:

  • The input vector [math]\displaystyle{ X }[/math] and all [math]\displaystyle{ \{A_{i,j}\} }[/math] are assumed to be random variables, where [math]\displaystyle{ A_{i,j} }[/math] is a Bernoulli random variable and all input elements of [math]\displaystyle{ X }[/math] are independent.

  • One further critical assumption is the spherical constraint; all parameter weights [math]\displaystyle{ w_i }[/math] (elements of the parameter matrices) satisfy a spherical bound:

    [math]\displaystyle{ 1/\Lambda \sum_i^\Lambda w_i^2 = C }[/math]

    for some [math]\displaystyle{ C \gt 0 }[/math] where [math]\displaystyle{ \Lambda }[/math] is the number of parameters.

  • These assumptions allow the network output to be modeled as a spherical spin glass model, which is a physical model for magnetic dipoles in ferromagnetic materials (a dipole has a magnetization state that is a binary random variable)

  • Using this assumption, the work by Auffinger et al. (2010) in the field of random matrices and spin glasses is then used to relate the energy states of system Hamiltonians of spin glass models to the critical points of the neural network loss function.

  • The analysis shows that the critical points of the loss function correspond to different energy bands in the spin glass model; as in a physical system, higher energy states are less probable; while the number of states is infinite, the probability of the system appearing in that state vanishes.

  • The energy barrier [math]\displaystyle{ E_{\infty} }[/math] stems from this analysis, and is given by

    [math]\displaystyle{ E_{\infty} = E_{\infty}(H) = 2\sqrt{\frac{H-1}{H}}. }[/math] Auffinger et al. show that critical values of the loss function must relate energies below [math]\displaystyle{ -\Lambda E_{\infty} }[/math] if their critical band index (i.e. energy index) is finite.

Experiments

The numerical experiments conducted were to verify the theoretical claims of the distribution of critical points around the energy bound [math]\displaystyle{ E_{\infty} }[/math], as well as to correlate the testing and training loss for different numbers of parameters [math]\displaystyle{ (\Lambda) }[/math] in the models.

MNIST Experiments

ReLu neural networks with a single layer and increasing [math]\displaystyle{ \Lambda \in \{25,50,100,250,500 \} }[/math] were training for multiclass classification on a scaled-down version of the MNIST digit dataset, where each image was downsampled to [math]\displaystyle{ 10 \times 10 }[/math] pixels. For each value of [math]\displaystyle{ \Lambda }[/math], 200 epochs of SGD with a decaying learning rate were used to optimize the parameters in the network. The optimization experiments were performed 1000 times with different initial values for the weight parameters drawn uniformly randomly from [math]\displaystyle{ [-1,1] }[/math].

Results

File:index dist.png
Distribution of normalized indices of energy states as computed from the system Hamiltonian at the final values of the parameters after the SGD optimization procedure completed.
File:loss distribution.png
Empirical distribution of the values of the loss function over the course of 1000 experiments with different numbers of parameters (Lambda). Each experimental run used a different random initialization of the parameter weights.
File:train test corr.png
Scatter plots showing the the correlation between training and testing error for the MNIST dataset experiments. For few parameters in the network, there is a very strong correlation between the two errors. However, for networks with many more parameters, the correlations decrease in strength, suggesting that obtaining the optimal loss (critical point) in the training phase does not improve the generalization error.