the loss surfaces of multilayer networks (Choromanska et al.): Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 23: Line 23:


where <math>A_{i,j}</math> is a random variable equal to 0 or 1, denoting a path <math>(i,j)</math> to be active (<math>A_{i,j} = 1</math>) or not (<math>A_{i,j} = 0</math>). In this expression, the first summation over <math>i</math> is over the elements of the network input vector, and the second summation over <math>j</math> is over all ''paths'' from <math>X_i</math> to the output. The upper index on this second summation is <math>\gamma =
where <math>A_{i,j}</math> is a random variable equal to 0 or 1, denoting a path <math>(i,j)</math> to be active (<math>A_{i,j} = 1</math>) or not (<math>A_{i,j} = 0</math>). In this expression, the first summation over <math>i</math> is over the elements of the network input vector, and the second summation over <math>j</math> is over all ''paths'' from <math>X_i</math> to the output. The upper index on this second summation is <math>\gamma =
n_1n_2\dots n_H</math> for all possible paths. The term <math>w_{i,j}^{(k)}</math> refers to the value of the parameter matrix in layer that corresponds to the hidden vector element that produced the path (i.e. the <math>k^{\text{th}}</math> segment of path indexed with <math>(i,j)</math>); hence why there are <math>H</math> <math>w_{i,j}^(k)</math> terms per path.
n_1n_2\dots n_H</math> for all possible paths. The term <math>w_{i,j}^{(k)}</math> refers to the value of the parameter matrix in the layer that corresponds to the hidden vector element that produced the path (i.e. the <math>k^{\text{th}}</math> segment of path indexed with <math>(i,j)</math>); hence why there are <math>H</math> <math>w_{i,j}^(k)</math> terms per path.


From this equation, it can be seen that the output of the <span>ReLu</span> network is polynomial in the weight matrix parameters, and the treatment of <math>A_{i,j}</math> as a random indicator variable allows connections to be made with spin glass models.
From this equation, it can be seen that the output of the <span>ReLu</span> network is polynomial in the weight matrix parameters, and the treatment of <math>A_{i,j}</math> as a random indicator variable allows connections to be made with spin glass models.

Revision as of 15:04, 9 December 2015

Overview

The paper Loss Surfaces of Multilayer Networks by Choromanska et al. is situated in the context of determining critical points (i.e. minima, maxima, or saddle points) of loss surfaces of deep multilayer network models, such as feedforward perceptrons.

The authors present a model of multilayer rectified linear units (ReLUs), and show that it may be expressed as a polynomial function of the parameter matrices in the network, with a polynomial degree equal to the number of layers. The ReLu units produce a piecewise, continuous polynomial, with monomials that are nonzero or zero at the boundaries between pieces. With this model, they study the distribution of critical points of the loss polynomial, providing an analysis with results from random matrix theory applied to spherical spin glasses.

The 3 key findings of this work are the following:

  • For large-size networks, most local minima are equivalent and yield similar performance on a test set.
  • The probability of finding a bad local minimum (i.e. one with a large value in terms of the loss function) may be large for small-size networks, but decreases quickly with network size.
  • Obtaining the global minimum of the loss function using a training dataset is not useful in practice.

Many theoretical results are reported, which will not be exhaustively covered here. However, a high-level overview of proof techniques will be given, followed by a summary of the experimental results.

Theoretical Analysis

Consider a simple fully-connected feed-forward deep network [math]\displaystyle{ \mathcal{N} }[/math] with a single output for a binary classification task. The authors use the convention that [math]\displaystyle{ (H-1) }[/math] denotes the number of hidden layers in the network (the input layer is the [math]\displaystyle{ 0^{\text{th}} }[/math] layer and the output layer is the [math]\displaystyle{ H^{\text{th}} }[/math] layer). The input [math]\displaystyle{ X }[/math] is a vector with [math]\displaystyle{ d }[/math] elements, assumed to be random. The variable [math]\displaystyle{ n_i }[/math] denotes the number of units in the [math]\displaystyle{ i^{\text{th}} }[/math] layer (due to the network restrictions, [math]\displaystyle{ n_0 = d }[/math] and [math]\displaystyle{ n_H = 1 }[/math]). Finally, [math]\displaystyle{ W_i }[/math] s the matrix of weights between [math]\displaystyle{ (i - 1)^{\text{th}} }[/math] and [math]\displaystyle{ i^{th} }[/math] layers of the network and [math]\displaystyle{ \sigma = \max(0,x) }[/math] is the activation function. For a random input [math]\displaystyle{ X }[/math], the random network output [math]\displaystyle{ Y }[/math] is [math]\displaystyle{ Y = q\sigma(W_H^{\top}\sigma(W_{H-1}^{\top}\dots\sigma(W_1^{\top}X)))\dots), }[/math] where [math]\displaystyle{ q }[/math] is a normalization factor.

The key assumption in the theoretical work is the following: for ReLu activation functions [math]\displaystyle{ \sigma(x) }[/math] for a random variable [math]\displaystyle{ x }[/math], the output can be seen as being equal to [math]\displaystyle{ \delta \cdot x }[/math], where [math]\displaystyle{ x }[/math] is a (not necessarily random) nonzero variable and [math]\displaystyle{ \delta }[/math] is a new random variable that is identically equal to either 0 or 1. With this in mind, the output of the network can be re-expressed as: [math]\displaystyle{ Y = q\sum_{i=1}^{n_0}X_{i}\sum_{j = 1}^\gamma A_{i,j}\prod_{k = 1}^{H}w_{i,j}^{(k)}, }[/math]

where [math]\displaystyle{ A_{i,j} }[/math] is a random variable equal to 0 or 1, denoting a path [math]\displaystyle{ (i,j) }[/math] to be active ([math]\displaystyle{ A_{i,j} = 1 }[/math]) or not ([math]\displaystyle{ A_{i,j} = 0 }[/math]). In this expression, the first summation over [math]\displaystyle{ i }[/math] is over the elements of the network input vector, and the second summation over [math]\displaystyle{ j }[/math] is over all paths from [math]\displaystyle{ X_i }[/math] to the output. The upper index on this second summation is [math]\displaystyle{ \gamma = n_1n_2\dots n_H }[/math] for all possible paths. The term [math]\displaystyle{ w_{i,j}^{(k)} }[/math] refers to the value of the parameter matrix in the layer that corresponds to the hidden vector element that produced the path (i.e. the [math]\displaystyle{ k^{\text{th}} }[/math] segment of path indexed with [math]\displaystyle{ (i,j) }[/math]); hence why there are [math]\displaystyle{ H }[/math] [math]\displaystyle{ w_{i,j}^(k) }[/math] terms per path.

From this equation, it can be seen that the output of the ReLu network is polynomial in the weight matrix parameters, and the treatment of [math]\displaystyle{ A_{i,j} }[/math] as a random indicator variable allows connections to be made with spin glass models.

The remainder of the theoretical analysis proceeds as follows:

  • The input vector [math]\displaystyle{ X }[/math] and all [math]\displaystyle{ \{A_{i,j}\} }[/math] are assumed to be random variables, where [math]\displaystyle{ A_{i,j} }[/math] is a Bernoulli random variable and all input elements of [math]\displaystyle{ X }[/math] are independent.

  • One further critical assumption is the spherical constraint; all parameter weights [math]\displaystyle{ w_i }[/math] (elements of the parameter matrices) satisfy a spherical bound:

    [math]\displaystyle{ 1/\Lambda \sum_i^\Lambda w_i^2 = C }[/math]

    for some [math]\displaystyle{ C \gt 0 }[/math] where [math]\displaystyle{ \Lambda }[/math] is the number of parameters.

  • These assumptions allow the network output to be modeled as a spherical spin glass model, which is a physical model for magnetic dipoles in ferromagnetic materials (a dipole has a magnetization state that is a binary random variable)

  • Using this assumption, the work by Auffinger et al. (2010) in the field of random matrices and spin glasses is then used to relate the energy states of system Hamiltonians of spin glass models to the critical points of the neural network loss function.

  • The analysis shows that the critical points of the loss function correspond to different energy bands in the spin glass model; as in a physical system, higher energy states are less probable; while the number of states is infinite, the probability of the system appearing in that state vanishes.

  • The energy barrier [math]\displaystyle{ E_{\infty} }[/math] stems from this analysis, and is given by

    [math]\displaystyle{ E_{\infty} = E_{\infty}(H) = 2\sqrt{\frac{H-1}{H}}. }[/math] Auffinger et al. show that critical values of the loss function must relate energies below [math]\displaystyle{ -\Lambda E_{\infty} }[/math] if their critical band index (i.e. energy index) is finite.

Experiments

The numerical experiments conducted were to verify the theoretical claims of the distribution of critical points around the energy bound [math]\displaystyle{ E_{\infty} }[/math], as well as to correlate the testing and training loss for different numbers of parameters [math]\displaystyle{ (\Lambda) }[/math] in the models.

MNIST Experiments

ReLu neural networks with a single layer and increasing [math]\displaystyle{ \Lambda \in \{25,50,100,250,500 \} }[/math] were training for multiclass classification on a scaled-down version of the MNIST digit dataset, where each image was downsampled to [math]\displaystyle{ 10 \times 10 }[/math] pixels. For each value of [math]\displaystyle{ \Lambda }[/math], 200 epochs of SGD with a decaying learning rate were used to optimize the parameters in the network. The optimization experiments were performed 1000 times with different initial values for the weight parameters drawn uniformly randomly from [math]\displaystyle{ [-1,1] }[/math].

Results

To evaluate the distribution of the energy states that each critical point (i.e. solution) of the loss function, the eigenvalues of the Hamiltonian matrix of the loss function was computed for the parameters after the optimization procdure completed. The distribution of the (normalized) index of the energy states is shown below in Fig 1. It can be seen that for all models with different numbers of parameters, the energy states occupied are the low energy bands.

File:index dist.png
Fig 1. Distribution of normalized indices of energy states as computed from the system Hamiltonian at the final values of the parameters after the SGD optimization procedure completed.

The final values of the loss function in these experiments are also shown in the histograms in Fig 2. Interestingly, the variance in the loss decreases with increasing numbers of parameters, despite the fact that the spread in the energy state (Fig. 1) increases. This shows that despite the fact that local minima are more prevalent for models with many parameters, there is no appreciable difference in the loss function at these minima: the minima are essentially all equally good in terms of minimizing the objective cost.

File:loss distribution.png
Empirical distribution of the values of the loss function over the course of 1000 experiments with different numbers of parameters (Lambda). Each experimental run used a different random initialization of the parameter weights.

Finally, a scatter plot of the training vs testing error for each model is shown in Fig. 3. It can be seen that the correlation between the two errors decreases as the number of parameters increases, suggesting that obtaining a global minimum would not necessarily produce better testing results (and hence still would have a sizeable generalization error).

File:train test corr.png
Fig 3. Scatter plots showing the the correlation between training and testing error for the MNIST dataset experiments. For few parameters in the network, there is a very strong correlation between the two errors. However, for networks with many more parameters, the correlations decrease in strength, suggesting that obtaining the optimal loss (critical point) in the training phase does not improve the generalization error.


Discussion

Power of Deep Neural Nets from the No Free Lunch Point View

A far out view for the explanation of why Deep Neural Networks has lower probability of bad local minimas is Woodward's <ref>Woodward, John R. "GA or GP? that is not the question." Evolutionary Computation, 2003. CEC'03. The 2003 Congress on. Vol. 2. IEEE, 2003.</ref> paper on why the No Free Lunch Theorem (NFLT) doesn't hold. First the NFLT is a theorem that basically states if one cannot incorporate domain specific knowledge into the search or optimization algorithm, one cannot guarantee the search/optimization algorithm can out perform (in terms of convergence speed) any other search/optimization algorithms, this implies there could be no universal search algorithm that is the best.

Woodward's argument is that whether you use Genetic Algorithm or Genetic Programming doesn't matter, what matters is the solution mapping. Consider the case the task of Symbolic Regression where we have 2 algorithms [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], let [math]\displaystyle{ P_{s} = \{+ba, a, b, +aa, +bb, +ab\} }[/math] and [math]\displaystyle{ Q_{s} = \{+ab, +ba, a, b, +aa, +bb\} }[/math] be the time ordered explored solutions of [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math]. If the problem we face requires a solution [math]\displaystyle{ +ab }[/math] then both [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] discovers the solution on their first try. However for any other solution algorithm [math]\displaystyle{ P }[/math] will always outperform [math]\displaystyle{ Q }[/math].

From the above we might be able to conclude that for deep neural networks the larger or deeper the network size, the more likely the network connections will be able to generate a function that minimizes the loss faster then smaller networks (since all networks were trained for 200 epochs, theoretically a single layer MLP should be able to approximate any function, but practically that could take forever), thus minimizing chances of bad local minimas (similar to how if you have a complex function you have a better chance of fitting data over a simpler function).