stat946w18/Self Normalizing Neural Networks

From statwiki
Revision as of 23:27, 1 March 2018 by X249wang (talk | contribs)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Introduction and Motivation

While neural networks have been making a lot of headway in improving benchmark results and narrowing the gap with human-level performance, success has been fairly limited to visual and sequential processing tasks through advancements in convolutional network and recurrent network structures. Most data science competitions outside of those domains are still being won by algorithms such as gradient boosting and random forests. The traditional (densely connected) feed-forward neural networks (FNNs) are rarely used competitively, and when they do win on the rare occasions, they are won with very shallow networks with just up to four layers.

The authors, Klambauer et al., believe that what prevents FNNs from becoming more useful is the inability to train a deeper FNN structure, which would allow the network to learn more levels of abstract representations. To have a deeper network, oscillations in the distribution of activations need to be kept under control so that stable gradients can be obtained during training. Several techniques are available to normalize activations, including batch normalization, layer normalization and weight normalization. These methods work well with CNNs and RNNs, but not so much with FNNs because backpropagating through normalization parameters introduces additional variance to the gradients, and regularization techniques like dropout further perturb the normalization effect. CNNs and RNNs are less sensitive to such perturbations, presumably due to their weight sharing architecture, but FNNs do not have such property, and thus suffer from high variance in training errors, which hinders learning. Furthermore, the aforementioned normalization techniques involving adding external layers to the model and can slow down computations.

Therefore, the authors were motivated to develop a new FNN implementation that can achieve the intended effect of normalization techniques that works well with stochastic gradient descent and dropout. Self-normalizing neural networks are based on the idea of scaled exponential linear units (SELU), a new activation function introduced in this paper, whose output distribution is proved to converge to a fixed point, thus making it possible to train deeper networks.

Notations

As the paper (primarily in the supplementary materials) comes with lengthy proofs, important notations are listed first.

Consider two fully-connected layers, let [math]\displaystyle{ x }[/math] denote the inputs to the second layer, then [math]\displaystyle{ z = Wx }[/math] represents the network inputs of the second layer, and [math]\displaystyle{ y = f(z) }[/math] represents the activations in the second layer.

Assume that all [math]\displaystyle{ x_i }[/math]'s, [math]\displaystyle{ 1 \leqslant i \leqslant n }[/math], have mean [math]\displaystyle{ \mu := \mathrm{E}(x_i) }[/math] and variance [math]\displaystyle{ \nu := \mathrm{Var}(x_i) }[/math] and that each [math]\displaystyle{ y }[/math] has mean [math]\displaystyle{ \widetilde{\mu} := \mathrm{E}(y) }[/math] and variance [math]\displaystyle{ \widetilde{\nu} := \mathrm{Var}(y) }[/math], then let [math]\displaystyle{ g }[/math] be the set of functions that maps [math]\displaystyle{ (\mu, \nu) }[/math] to [math]\displaystyle{ (\widetilde{\mu}, \widetilde{\nu}) }[/math].

For the weight vector [math]\displaystyle{ w }[/math], [math]\displaystyle{ n }[/math] times the mean of the weight vector is [math]\displaystyle{ \omega := \sum_{i = 1}^n \omega_i }[/math] and [math]\displaystyle{ n }[/math] times the second moment is [math]\displaystyle{ \tau := \sum_{i = 1}^{n} w_i^2 }[/math].

Key Concepts

Self-Normalizing Neural-Net (SNN)

A neural network is self-normalizing if it possesses a mapping [math]\displaystyle{ g: \Omega \rightarrow \Omega }[/math] for each activation [math]\displaystyle{ y }[/math] that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on [math]\displaystyle{ (\omega, \tau) }[/math] in [math]\displaystyle{ \Omega }[/math]. Furthermore, the mean and variance remain in the domain [math]\displaystyle{ \Omega }[/math], that is [math]\displaystyle{ g(\Omega) \subseteq \Omega }[/math], where [math]\displaystyle{ \Omega = \{ (\mu, \nu) | \mu \in [\mu_{min}, \mu_{max}], \nu \in [\nu_{min}, \nu_{max}] \} }[/math]. When iteratively applying the mapping [math]\displaystyle{ g }[/math], each point within [math]\displaystyle{ \Omega }[/math] converges to this fixed point.

In other words, in SNNs, if the inputs from an earlier layer ([math]\displaystyle{ x }[/math]) already have their mean and variance within a predefined interval [math]\displaystyle{ \Omega }[/math], then the activations to the next layer ([math]\displaystyle{ y = f(z = Wx) }[/math]) should remain within those intervals. This is true across all pairs of connecting layers as the normalizing effect gets propagated through the network, hence why the term self-normalizing. When the mapping is applied iteratively, it should draw the mean and variance values closer to a fixed point within [math]\displaystyle{ \Omega }[/math], the value of which depends on [math]\displaystyle{ \omega }[/math] and [math]\displaystyle{ \tau }[/math] (recall that they are from the weight vector).

The activation function that makes an SNN possible should meet the following four conditions:

1. It can take on both negative and positive values, so it can normalize the mean;

2. It has a saturation region, so it can dampen variances that are too large;

3. It has a slope larger than one, so it can increase variances that are too small; and

4. It is a continuous curve, which is necessary for the fixed point to exist (see the definition of Banach fixed point theorem to follow).

Commonly used activation functions such as rectified linear units (ReLU), sigmoid, tanh, leaky ReLUs and exponential linear units (ELUs) do not meet all four criteria, therefore, a new activation function is needed.

Scaled Exponential Linear Units (SELUs)

One of the main ideas introduced in this paper is the SELU function. As the name suggests, it is closely related to ELU,

\[ \mathrm{elu}(x) = \begin{cases} x & x > 0 \\ \alpha e^x - \alpha & x \leqslant 0 \end{cases} \]

but further builds upon it by introducing a new scale parameter $\lambda$ and proving the exact values that $\alpha$ and $\lambda$ should take on to achieve self-normalization. SELU is defined as:

\[ \mathrm{selu}(x) = \lambda \begin{cases} x & x > 0 \\ \alpha e^x - \alpha & x \leqslant 0 \end{cases} \]

SELUs meet all four criteria listed above - it takes on positive values when [math]\displaystyle{ x \gt 0 }[/math] and negative values when [math]\displaystyle{ x \lt 0 }[/math], it has a saturation region when [math]\displaystyle{ x }[/math] is a larger negative value, the value of [math]\displaystyle{ \lambda }[/math] can be set to greater than one to ensure a slope greater than one, and it is continuous at [math]\displaystyle{ x = 0 }[/math].

Figure 1 below gives an intuition for how SELUs normalize activations across layers. As shown, a variance dampening effect occurs when inputs are negative and far away from zero, and a variance increasing effect occurs when inputs are close to zero.

File:snn 946 f1.png

Figure 2 below plots the progression of training error on the MNIST and CIFAR10 datasets when training with SNNs versus FNNs with batch normalization at varying model depths. As shown, FNNs that adopted the SELU activation function exhibited lower and less variable training loss compared to using batch normalization, even as the depth increased to 16 and 32 layers.

File:snn 946 f2.png

Banach Fixed Point Theorem and Contraction Mappings

The underlying theory behind SNNs is the Banach fixed point theorem, which states the following: Let [math]\displaystyle{ (X, d) }[/math] be a non-empty complete metric space with a contraction mapping [math]\displaystyle{ f: X \rightarrow X }[/math]. Then [math]\displaystyle{ f }[/math] has a unique fixed point [math]\displaystyle{ x_f \subseteq X }[/math] with [math]\displaystyle{ f(x_f) = x_f }[/math]. Every sequence [math]\displaystyle{ x_n = f(x_{n-1}) }[/math] with starting element [math]\displaystyle{ x_0 \subseteq X }[/math] converges to the fixed point: [math]\displaystyle{ x_n \underset{n \rightarrow \infty}\rightarrow x_f }[/math].

A contraction mapping is a function [math]\displaystyle{ f: X \rightarrow X }[/math] on a metric space [math]\displaystyle{ X }[/math] with distance [math]\displaystyle{ d }[/math], such that for all points [math]\displaystyle{ \mathbf{u} }[/math] and [math]\displaystyle{ \mathbf{v} }[/math] in [math]\displaystyle{ X }[/math]: [math]\displaystyle{ d(f(\mathbf{u}), f(\mathbf{v})) \leqslant \delta d(\mathbf{u}, \mathbf{v}) }[/math], for a [math]\displaystyle{ 0 \leqslant \delta \leqslant 1 }[/math].

The easiest way to prove a contraction mapping is usually to show that the spectral norm of its Jacobian is less than 1 [REFERENCE], as was done for this paper.

Proving the Self-Normalizing Property

Mean and Variance Mapping Function

[math]\displaystyle{ g }[/math] is derived under the assumption that [math]\displaystyle{ x_i }[/math]'s are independent but not necessarily having the same mean and variance [1]. Under this assumption (and recalling earlier notation of [math]\displaystyle{ \omega }[/math] and [math]\displaystyle{ \tau }[/math]),

\begin{align} \mathrm{E}(z = \mathbf{w}^T \mathbf{x}) = \sum_{i = 1}^n w_i \mathrm{E}(x_i) = \mu \omega \end{align}

\begin{align} \mathrm{Var}(z) = \mathrm{Var}(\sum_{i = 1}^n w_i x_i) = \sum_{i = 1}^n w_i^2 \mathrm{Var}(x_i) = \nu \sum_{i = 1}^n w_i^2 = \nu\tau \textrm{ .} \end{align}

When the weight terms are normalized, [math]\displaystyle{ z }[/math] can be viewed as a weighted sum of [math]\displaystyle{ x_i }[/math]'s. Wide neural net layers with a large number of nodes is common, so [math]\displaystyle{ n }[/math] is usually large, and by the Central Limit Theorem, [math]\displaystyle{ z }[/math] approaches a normal distribution [math]\displaystyle{ \mathcal{N}(\mu\omega, \sqrt{\nu\tau}) }[/math].

Using the above property, the exact form for [math]\displaystyle{ g }[/math] can be obtained using the definitions for mean and variance of continuous random variables:

Analytical solutions for the integrals can be obtained as follows:

The authors are interested in the fixed point [math]\displaystyle{ (\mu, \nu) = (0, 1) }[/math] as these are the parameters associated with the common standard normal distribution. The authors also proposed using normalized weights such that [math]\displaystyle{ \omega = \sum_{i = 1}^n = 0 }[/math] and [math]\displaystyle{ \tau = \sum_{i = 1}^n w_i^2= 1 }[/math] as it gives a simpler, cleaner expression for [math]\displaystyle{ \widetilde{\mu} }[/math] and [math]\displaystyle{ \widetilde{\nu} }[/math] in the calculations in the next steps. This weight scheme can be achieved in several ways, for example, by drawing from a normal distribution [math]\displaystyle{ \mathcal{N}(0, \frac{1}{n}) }[/math] or from a uniform distribution [math]\displaystyle{ U(-\sqrt{3}, \sqrt{3}) }[/math].

At [math]\displaystyle{ \widetilde{\mu} = \mu = 0 }[/math], [math]\displaystyle{ \widetilde{\nu} = \nu = 1 }[/math], [math]\displaystyle{ \omega = 0 }[/math] and [math]\displaystyle{ \tau = 1 }[/math], the constants [math]\displaystyle{ \lambda }[/math] and [math]\displaystyle{ \alpha }[/math] from the SELU function can be solved for - [math]\displaystyle{ \lambda_{01} \approx 1.0507 }[/math] and [math]\displaystyle{ \alpha_{01} \approx 1.6733 }[/math]. These values are used throughout the rest of the paper whenever an expression calls for [math]\displaystyle{ \lambda }[/math] and [math]\displaystyle{ \alpha }[/math].