stat946w18/Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers

From statwiki
Revision as of 17:28, 28 February 2018 by R33tang (talk | contribs) (Begin summary)
Jump to navigation Jump to search

Introduction

With the recent and ongoing surge in low-power, intelligent agents (such as wearables, smartphones, and IoT devices), there exists a growing need for machine learning models to work well in memory- and cpu-constrained environments. Deep learning models have achieved state-of-the-art on a broad range of tasks; however, they are difficult to deploy in their original forms. For example, AlexNet (Krizhevsky et al., 2012), a model for image classification, contains 61 million parameters and requires 1.5 billion FLOPs in one inference pass. A more accurate model, ResNet-50 (He et al., 2016), has 25 million parameters but requires 4.08 billion FLOPs. Clearly, it would be difficult to deploy and run these models on low-power devices.

In general, model compression can be accomplished using four main non-exclusive methods (Cheng et al., 2017): weight pruning, quantization, matrix transformations, and weight tying. Ye et al. (2018) explores pruning entire channels in a convolutional neural network. Past work has mostly focused on norm- or error-based heuristics to prune channels; instead, Ye et al. (2018) show that their approach is, "mathematically appealing from an optimization perspective and easy to reproduce" (Ye et al., 2018). In other words, they argue that the norm-based assumption is not as informative or theoretically justified as their approach, and provide strong empirical findings.

Motivation

Some previous work on pruning channel filters (Li et al., 2016; Molchanov et al., 2016) have focused on using the L1 norm to determine the importance of a channel. Ye et al. (2018) show that, in the deep linear convolution case, penalizing the per-layer norm is coarse-grained; the impact on the total loss is difficult to measure. Consider the loss with LASSO:

$$\min \mathbb{E}_D \lVert W_n * \dots * W_1 x - y\rVert^2 + \lambda \sum_{i=1}^n \lVert W_i \rVert_1$$

where [math]\displaystyle{ W }[/math] are the weights and [math]\displaystyle{ * }[/math] is convolution. Ye et al. (2018) show that, by multiplying and dividing subsets of the weights by some constant [math]\displaystyle{ \alpha }[/math], it is always possible to reduce the loss without norm penalization. Furthermore, batch normalization (Ioffe, 2015) is incompatible with this method of weight regularization. In other words, penalizing the norm of a filter in a deep convolutional network is hard to justify from a theoretical perspective.

Thus, although not providing a complete theoretical guarantee on loss, Ye et al. (2018) develop a pruning technique that claims to be more justified than norm-based pruning is.

Method

At a high level, Ye et al. (2018) propose that, instead of discovering sparsity via penalizing the per-filter or per-channel norm, penalize the batch normalization scale parameters [math]\displaystyle{ \gamma }[/math] instead. The reasoning is that by having fewer parameters to constrain and working with normalized values, sparsity is easier to enforce, monitor, and learn. Having sparse batch normalization terms has the effect of pruning entire channels: if [math]\displaystyle{ \gamma }[/math] is zero, then the output at that layer becomes constant (the bias term), and thus the preceding channels can be pruned.

References

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
  • Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv preprint arXiv:1710.09282.
  • Ye, J., Lu, X., Lin, Z., & Wang, J. Z. (2018). Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers. arXiv preprint arXiv:1802.00124.
  • Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.
  • Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference.
  • Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456).