Learning the Number of Neurons in Deep Networks
Introduction
Due to the availability of large-scale datasets and powerful computation, Deep Learning has made huge breakthroughs in many areas, like Language Models and Computer Vision. In deep neural networks, we need to determine the number of layers and the number of neurons in each layer, i.e, we need to determine the number of parameters, or complexity of the model. Typically, this is determined by errors manually. Currently, this is mostly achieved by manually tuning these hyper-parameters using validation data or building very deep networks. However, building a very deep model is still challenging, especially for the very large datasets, which causes a high cost on memory and speed.
In this paper, we use an approach to automatically select the number of neurons in each layer when we learn the network. Our approach introduces a group sparsity regularizer on the parameters of the network, and each group acts on the parameters of one neuron, rather than trains an initial network as as pre-processing step(training shallow or thin networks to mimic the behaviour of deep ones [Hinton et al., 2014, Romero et al., 2015]). We set those useless parameters to zero, which cancels out the effects of a particular neuron. Therefore, our approach does not need to learn a redundant network successfully and then reduce its parameters, instead, it learns the number of relevant neurons in each layer and the parameters of those neurons simultaneously.
In the experiments of several image recognition datasets, we show the effectiveness of our approach, which reduces the number of parameters by up to 80% compared to the complete model, and has no recognition accuracy loss at the same time. Actually, our approach even yields more effective and faster networks, and occupies less memory.
Related Work
The recent researches tend to build very deep networks. Building very deep networks means we need to learn more parameters, which leads to a significant cost on the memory of the equipment as well as the speed. Even though automatic model selection has developed in the past years by constructive and destructive approaches, there are some drawbacks. For constructive method, it starts a super shallow architecture, and then adds additional parameters [Bello, 1992]. A similar work that adds new layers to the initial shallow networks was successfully employed [Simonyan and Zisserman, 2014] at the process of learning. However, we know shallow networks have fewer parameters, so that it can not handle the non-linearities as effectively as the deep networks [Montufar et al., 2014], so shallow networks may easily get stuck by the bad optima. Therefore, the drawback of this method is that these networks may produce poor initializations for the later processes. For destructive method, it starts by a deep network to reduce a significant number of redundant parameters [Denil et al., 2013, Cheng et al., 2015] while keeping its behaviour unchanged. Even though this technique has shown removing the redundant parameters [LeCun et al., 1990, Hassibi et al., 1993] or the neutrons [Mozer and Smolensky, 1988, Ji et al., 1990, Reed, 1993] has little influence on the output, it requires the analysis of each parameter and neuron by network Hessian, which is very computationally expensive for large architectures. The main motivation of these works was to build a more compact network.
Particularly, building a compact network is a research focus for Convolutional Neural Networks(CNNs). Some works has proposed to decompose the filters of a pre-trained network into low-rank filters, which reduces the number of parameters [Jaderberg et al., 2014b, Denton et al., 2014, Gong et al., 2014]. The issue of this proposal is that we need to successfully train an initial deep network, since it acts as as post-processing step. [Weigend et al., 1991] and [Collins and Kohl, 2014] used direct training to develop regularizers that eliminate some of the parameters of the network. The problem is that the number of layers and neurons each layer is determined manually.
Model Training and Model Selection
In general, a deep network has L layers containing linear operations on their inputs, intertwined with activation functions. The activation function we generally use is Rectified Linear Units(RELU) or sigmoids. Suppose each layer l has $N_{l}$ neurons, and each of them has parameters $\Theta=(\theta_{l})_{1\leqslant{l}\leqslant{L}}$, where $\theta_{l}=({\theta^n _{l}})_{1\leqslant{n}\leqslant{N_{l}}}$ and $\theta^n _{l}=[w_{l}^{n},b_{l}^{n}]$. Given an input $x$, under the linear, on-linear and pooling operations, we obtain the output $\hat{y}=f(x,\theta)$, where $f(*)$ encodes the succession of linear, non-linear and pooling operations.
At the step of training, we have N input-output pairs ${(x_{i},y_{i})}_{1\leqslant{i}\leqslant{N}}$, and the loss function is given by $\ell(y_{i},f(x_{i},\Theta))$, which compares the predicted output with the ground-truth ouoput. Generally, we choose logistic loss for classification and the square loss for regression. Therefore, learning the parameters of the network is equivalent to solving the optimization of the following: $$\displaystyle \min_{\Theta}\frac{1}{N}\sum_{i=1}^{N}\ell(y_{i},f(x_{i},\Theta))+\gamma(\Theta),$$ where $\gamma(\Theta)$ represents a regularizer on the network parameters. Our choice for the regularizer can be $\ell_{2}$-norm(i.e, weight decay) or $\ell_{1}$-norm. $\ell_{2}$-norm usually favours small parameter values, and $\ell_{1}$-norm can only delete those irrelevant parameters, but not the neurons. The goal in this paper is to automatically determine the number of neurons of each layer, but neither of the above techniques achieve this goal. Here, we make use of the group sparsity [Yuan and Lin., 2007] (starting from an overcomplete network and canceling the influence of some neurons). The regularizer, therefore, can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2},$$ where $P_{l}$ means the size of the vector that includes the parameters of each neuron in layer $l$, and $\beta_{l}$ balances the influence of the penalty. In practice, we found the most effective way to select $\beta$ is a relatively small one for the first few layers, and a larger weight for the remaining layers. The reason we choose a small weight is that it can prevent deleting too much neurons in the first few layers, so that we have enough information for learning the remaining parameters.
The group sparsity helps us effectively remove some of the neurons, and also standard regularizers on the individual parameters are effective for the generalization purpose [Bartlett, 19996, Krogh and Hertz, 1992, Theodoridis, 2015, Collins and Kohli, 2014]. By this idea, we introduce sparse group Lasso, from which the regularizer can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}((1-\alpha)\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2}+\alpha\beta_{l}||\theta_{l}||_{1},$$ where $\alpha\in[0,1]$. We find that if $\alpha=0$, then we have the sparsity group regularizer. In practice, we use both $\alpha=0$ and $\alpha=0.5$ in the experiments.
To find the optimization, in this paper we use proximal gradient descent [Parikh and Boyed, 2014]. This approach iteratively takes a gradient step of size t with respect to the loss. The following is the algorithm for it: We define proximal mapping as $$prox_{t}(x)=\displaystyle \min_{z}\frac{1}{2t}||x-z||_{2}^{2}+h(z)$$ We initialize $x^{(0)}$, and then repeat $$x^{(k)}=prox_{t_{k}}(x^{k-1}-t_{k}\nabla{g}(x^{k-1})), k=1,2,3...$$
Therefore, we can update our parameter by the above method as $$\tilde{\theta}_{l}^{n}=\displaystyle \min_{\theta_{l}^{n}}\frac{1}{2t}||\theta_{l}^{n}-\hat{\theta}_{l}^{n}||_{2}^{2}+\gamma(\Theta),$$ where $\hat{\theta}_{l}^{n}$ is the solution obtained from the general loss gradient. By the derivative of [Simon et al., 2013], we have a closed-form solution for this problem: $$\tilde{\theta}_{l}^{n}=(1-\frac{t(1-\alpha)\beta_{l}\sqrt{P_{l}}}{||S(\hat{\theta}_{l}^{n},t\alpha\beta_{l})||_{2})})_{+}S(\hat{\theta}_{l}^{n},t\alpha\beta_{l}),$$ where + refers to taking the maximum between the argument and 0, and $S(*)$ is $$S(a,b)=sign(a)(|a|-b)_{+}$$ In practice, we use stochastic gradient descent and work with mini-batches, and then update the variables of all the groups according to the closed-form of $\tilde{\theta}_{l}^{n}$. When the learning steps terminate, we remove the neurons whose parameters have gone to zero.