# Introduction

The paper basically seeks to answer the question why deep neural networks perform so much better than shallow neural networks. It is not obvious that deep neural networks should perform any better. For example, Funahashi 1989 showed that a neural network with just one hidden layer is a universal function approximator (given sufficiently many neurons). Thus, the class of functions a deep neural network can approximate cannot be larger. Furthermore, having many layers can theoretically cause problems due to vanishing gradients.

As both shallow and deep neural networks can approximate the same class of functions, another method of comparison is needed. For this we have to consider what neural networks do. Basically, they split the input space in piecewise linear units. It seems that deep neural networks have more segments (with the same number of neurons) which allows them to produce a more complex function approximate. Essentially, after partitioning the original input space piecewise linearly, each subsequent layer recognizes pieces of the original input such that the composition of these layers correspondingly identifies an exponential number of input regions. This is caused by the deep hierarchy which allows to apply the same computation across different regions of the input space.

# Shallow Neural Networks

First, an upper limit of regions a shallow neural network produces is derived. This gives not only a measure of the approximation complexity possible with a shallow neural network, but will also be used to obtain the number of regions for deep neural networks.

The hidden layer of a shallow neural network with inputs and hidden units essentially computes with input , weight matrix , bias vector , and non-linearity . If the non-linearity of is at 0 or if there is an inflection at 0, this gives a distinguished behavior for which can act as decision boundary and represents a hyperplane.

Let us consider the set of all those hyperplanes (). This set splits the input space in several regions (formally defined as a connected component of ).

With hyperplanes (in general alignment) there will be at most regions.

# Deep Neural Networks

A hidden layer of a deep neural network computes a function which maps a set to another set . In this mapping there might be subsets that get mapped onto the same subset , i.e. . The set of all these subsets is denoted with .

This allows to define the number of separate input-space neighbourhoods mapped onto a common neighbourhood . For each subset that maps to we have to add up the number of subsets mapping to giving the recursive formula with for each region in the input space. Applying this formula for each distinct linear region computed by the last hidden layer, a set denoted with , we get the maximal number of linear regions of the functions computed by an -layer neural network with piecewise linear activations as

# Space Folding

An intuition of the process of mapping input-space neighbourhoods to common neighbourhoods can be given in terms of space folding. Each such mapping can be seen as folding the input space so that the input-space neighbourhoods are overlayed. Thus, each hidden layer of a deep neural network can be associated with a folding operator and any function computed on the final folded space will be applied to all regions successively folded onto each other. Note that the foldings are encoded in the weight matrix , bias vector and activation function . This allows for foldings separate from the coordinate axes and non-length preserving foldings.

# Deep Rectifier Networks

To obtain a lower bound on the maximal number of linear regions computable by a deep rectifier network, a network is constructed in such a way that the number of linear regions mapped onto each other is maximized. Each of units in a layer of rectifiers will only process one of the inputs. This gives a partition of rectifier units where each partition has a cardinality of (ignoring remaining units). For each subset we select the -th input with a row vector with the -th entry 1 and the remaining entries 0. The bias values are included in these activation functions for the units: Next, these activation functions are added with alternating signs. Note that this calculation can be absorbed in the connections weights to the next layer. This gives us a function which folds segments onto the interval .

Going from these functions for subsets of rectifiers to the full dimensional function gives a total of hypercubes mapped onto the same output.

Counting the number of separate regions produced by the last layer and multiplying this together with number of regions mapping to this layer, we get as the lower bound of the maximal number of linear regions of functions computed by a deep rectifier network with inputs and hidden layers. We can also denote this lower bound with which makes it clear that this number grows exponentially with versus a polynomial scaling of a shallow model with hidden units.

In fact, it is possible to obtain asymptotic bounds on the number of linear regions per parameter in the neural network models:

• For a deep model, the asymptotic bound is exponential:
• For a shallow model, the asymptotic bound is polynomial:

# Conclusion

The number of piecewise linear segments the input space can be split into grows exponentially with the number of layers of a deep neural network, whereas the growth is only polynomial with the number of neurons. This explains why deep neural networks perform so much better than shallow neural networks. The paper showed this result for deep rectifier networks and deep maxout networks, but the same analysis should be applicable to other types of deep neural networks.

Furthermore, the paper provides a useful intuition in terms of space folding to think about deep neural networks.