# Introduction

The paper basically seeks to answer the question why deep neural networks perform so much better than shallow neural networks. It is not obvious that deep neural networks should perform any better. For example, Funahashi 1989 showed that a neural network with just one hidden layer is a universal function approximator (given sufficiently many neurons). Thus, the class of functions a deep neural network can approximate cannot be larger. Furthermore, having many layers can theoretically cause problems due to vanishing gradients.

As both shallow and deep neural networks can approximate the same class of functions, another method of comparison is needed. For this we have to consider what neural networks do. Basically, they split the input space in piecewise linear units. It seems that deep neural networks have more segments (with the same number of neurons) which allows them to produce a more complex function approximate. Essentially, after partitioning the original input space piecewise linearly, each subsequent layer recognizes pieces of the original input such that the composition of these layers correspondingly identifies an exponential number of input regions. This is caused by the deep hierarchy which allows to apply the same computation across different regions of the input space.

# Shallow Neural Networks

First, an upper limit of regions a shallow neural network produces is derived. This gives not only a measure of the approximation complexity possible with a shallow neural network, but will also be used to obtain the number of regions for deep neural networks.

The hidden layer of a shallow neural network with $n_0$ inputs and $n_1$ hidden units essentially computes $\mathbf{x} \mapsto g(\mathbf{W}\mathbf{x} + \mathbf{b})$ with input $\mathbf{x}$, weight matrix $\mathbf{W}$, bias vector $\mathbf{b}$, and non-linearity $\, g$. If the non-linearity of $g$ is at 0 or if there is an inflection at 0, this gives a distinguished behavior for $\mathbf{W}\mathbf{x} + \mathbf{b} = 0$ which can act as decision boundary and represents a hyperplane.

Let us consider the set $H_i := \{\mathbf{x} \in \mathbb{R}^{n_0}: \mathbf{W}_{i,:}\mathbf{x} + \mathbf{b}_i = 0\}$ of all those hyperplanes ($i \in [n_1]$). This set splits the input space in several regions (formally defined as a connected component of $\mathbb{R}^{n_0} \setminus (\cup_i H_i)$).

With $n_1$ hyperplanes (in general alignment) there will be at most $\sum_{j=0}^{n_0} \binom{n_1}{j}$ regions.

# Deep Neural Networks

A hidden layer $l$ of a deep neural network computes a function $h_l$ which maps a set $S_{l-1} \in \mathbb{R}^{n_{l-1}}$ to another set $S_{l} \in \mathbb{R}^{n_l}$. In this mapping there might be subsets $\bar{R}_1, \dots, \bar{R}_k \subseteq S_{l-1}$ that get mapped onto the same subset $R \subseteq S_l$, i.e. $h_l(\bar{R}_1) = \cdots = h_l(\bar{R}_k) = R$. The set of all these subsets is denoted with $P_R^l$.

This allows to define the number of separate input-space neighbourhoods mapped onto a common neighbourhood $R$. For each subset $\bar{R}_i$ that maps to $R$ we have to add up the number of subsets mapping to $\bar{R}_i$ giving the recursive formula $\mathcal{N}_R^l = \sum_{R' \in P_R^l} \mathcal{N}_{R'}^{l-1}$ with $\mathcal{N}_R^0 = 1$ for each region $R \subseteq \mathbb{R}^{n_0}$ in the input space. Applying this formula for each distinct linear region computed by the last hidden layer, a set denoted with $P^L$, we get the maximal number of linear regions of the functions computed by an $L$-layer neural network with piecewise linear activations as $\mathcal{N} = \sum_{R \in P^L} \mathcal{N}_R^{L-1} \text{.}$

# Space Folding

An intuition of the process of mapping input-space neighbourhoods to common neighbourhoods can be given in terms of space folding. Each such mapping can be seen as folding the input space so that the input-space neighbourhoods are overlayed. Thus, each hidden layer of a deep neural network can be associated with a folding operator and any function computed on the final folded space will be applied to all regions successively folded onto each other. Note that the foldings are encoded in the weight matrix $\mathbf{W}$, bias vector $\mathbf{b}$ and activation function $g$. This allows for foldings separate from the coordinate axes and non-length preserving foldings.

# Deep Rectifier Networks

To obtain a lower bound on the maximal number of linear regions computable by a deep rectifier network, a network is constructed in such a way that the number of linear regions mapped onto each other is maximized. Each of $n$ units in a layer of rectifiers will only process one of the $n_0$ inputs. This gives a partition of rectifier units where each partition has a cardinality of $p = \lfloor n/n_0 \rfloor$ (ignoring remaining units). For each subset $j$ we select the $j$-th input with a row vector $\mathbf{w}$ with the $j$-th entry 1 and the remaining entries 0. The bias values are included in these activation functions for the $p$ units: $h_1(\mathbf{x}) = \max \{ 0, \mathbf{w}\mathbf{x} \}$ $h_i(\mathbf{x}) = \max \{ 0, 2\mathbf{w}\mathbf{x} - i + 1 \}, \quad 1 \lt i \leq p$ Next, these activation functions are added with alternating signs. Note that this calculation can be absorbed in the connections weights to the next layer. $\tilde{h}_j(\mathbf{x}) = h_1(\mathbf{x}) - h_2(\mathbf{x}) + h_3(\mathbf{x}) - \cdots + {(-1)}^{p-1} h_p(\mathbf{x})$ This gives us a function which folds $p$ segments $(-\infty, 0],\ [0, 1], [1, 2],\ \ldots,\ [p - 1, \infty)$ onto the interval $(0, 1)$.

Going from these $n_0$ functions for subsets of rectifiers to the full $n_0$ dimensional function $\tilde{h} = {[\tilde{h}_1, \tilde{h}_2, \ldots, \tilde{h}_{n_0}]}^{\top}$ gives a total of $p^{n_0}$ hypercubes mapped onto the same output.

Counting the number of separate regions produced by the last layer and multiplying this together with number of regions mapping to this layer, we get $\underbrace{\left( \prod_{i=1}^{L-1} {\left\lfloor \frac{n_i}{n_0} \right\rfloor}^{n_0} \right)}_{\text{mapped hypercubes}} \cdot \underbrace{\sum_{j=0}^{n_0} \binom{n_L}{j}}_{\text{last layer (shallow net)}}$ as the lower bound of the maximal number of linear regions of functions computed by a deep rectifier network with $n_0$ inputs and $L$ hidden layers. We can also denote this lower bound with $\Omega\!\left({\left(\frac{n}{n_0}\right)}^{(L-1)n_o} n^{n_0}\right)$ which makes it clear that this number grows exponentially with $L$ versus a polynomial scaling of a shallow model with $nL$ hidden units.

In fact, it is possible to obtain asymptotic bounds on the number of linear regions per parameter in the neural network models:

• For a deep model, the asymptotic bound is exponential: $\Omega\left(\left(n/n_0\right)^{n_0(L-1)}\frac{n^{n_0-2}}{L}\right)$
• For a shallow model, the asymptotic bound is polynomial: $O(L^{n_0-1}n^{n_0-1})$

# Conclusion

The number of piecewise linear segments the input space can be split into grows exponentially with the number of layers of a deep neural network, whereas the growth is only polynomial with the number of neurons. This explains why deep neural networks perform so much better than shallow neural networks. The paper showed this result for deep rectifier networks and deep maxout networks, but the same analysis should be applicable to other types of deep neural networks.

Furthermore, the paper provides a useful intuition in terms of space folding to think about deep neural networks.