Summary of A Probabilistic Approach to Neural Network Pruning
Stefan Vladusic, Waqas Hamed, Justin D'Astous, Ethan O'Farrell
High performance neural networks often have a very large number of training parameters. [Frankle and Carbin] propose the Lottery Ticket Hypothesis that a trained neural network will contain many smaller subnetworks containing the most important logic. When training these subnetworks in isolation after the initialization, they can acheive similar accuracy compared to the original network. [Ramanujan et al.] further proposes that the subnetworks can achieve similar accuracy without having to be further trained. However, finding these lottery tickets inside a large neural network is computationally expensive (NP hard in general).
There has been little research on the theoretical guarantees of pruning. This study, A Probabilistic Approach to Neural Network Pruning by Xin Qian and Diego Klabjan, focuses on the theoretical results from creating subnetworks by pruning weights. It covers pruning random weights in Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs) as well as pruning weights based on their magnitude in FCNs. It does not cover the theoretical implications of pruning entire neurons.
Empirical Network Pruning: The theory on neural network pruning has transitioned from pruning individual connections based on the loss function [Lecun et al] to pruning connections based on its magnitude [Han et al]. Papers began to focus on pruning entire neurons [Hu et al] or convolution channels [Li et al] that could be redundant. Recently, the focus has shifted to instead find a smaller subnetwork inside an overparameterized network with comparable predictive power and prune the rest.
Theoretical Study of Network Pruning: Some research has looked into pruning FCNs with ReLU activation. Their work used the idea that a single ReLU neuron can be approximated using a two-hidden-layer neural network with constant width [Pensia et al, Orseau et al]. Other methods of pruning currently being researched are a greedy optimization based pruning method [Ye et al] and sampling-based pruning algorithms based on sensitivity scores [Baykal et al, Liebenwein et al].
Deep neural networks can become computationally intensive and memory intensive. To tackle that people have employed neural network pruning in recent years. Pruning is simply the act of reducing the weights of a neural network layer. As explained in detail later, pruning can reduce the complexity of the network while keeping the difference in errors of the original neural network and the pruned neural network minimal. Pruning is either performed on individual weights or on entire neurons (whole column in a weight matrix). In the paper, only pruning individual weights has been discussed.
Pruning is a 3-step process that can be summarized as:
1. Train the network.
2. Prune the network.
3. Re-train the network.
The figure [Han et al, 2015] below demonstrates both pruning individual weights/synapses and pruning neurons.
Magnitude based pruning refers to reducing the number of weights in a layer based on the weight’s absolute value. The intuition behind this is that the weights with values close to zero, do not contribute significantly to the predictive ability of the network. To perform this, a mask matrix M, which has the same size as the weight matrix, is used. M has all entries equal to 1, except elements corresponding to weights to be pruned where the entry is instead 0. These weights are selected based on their magnitude. In the pruned network, the mask is multiplied element-wise with the weight matrix before re-training.
Random pruning also uses a mask matrix M. The weights to be pruned are decided randomly selected.
Pruning Fully Connected Networks
Issues raised by Balock et al argue that in general, the community surrounding neural network pruning research suffers from a lack of standardization. Definitions tend to be ambiguous and comparisons are difficult to make between performances of different pruning algorithms. Overall, this sector will see much more research and focus in the near future. This paper is a foundation for theoretical works regarding pruning both FCNs and CNNs. Importantly, since nearly all network architectures can be replaced by some combination of FCNs and CNNs, this paper has application in pruning many different architectures. Applying these results to past works by Frankle & Carbin in 2018 gives a method to determine an upper bound on the number of weights available for pruning in a given layer in order to maintain the desired error margin.
The paper’s analysis assumes that weights in the original network are independent. This assumption is not ideal, especially in CNNs. While the paper’s appendix discusses ways around it, no solution could be found. The authors suggest that an in-depth study of this assumption is warranted in the future. Another topic not explored fully is using gradients of the weights in the original network to determine more optimal pruning algorithms.
 Baykal, C., Liebenwein, L., Gilitschenski, I., Feldman, D., and Rus, D. Sipping neural networks: Sensitivity informed provable pruning of neural networks. arXiv preprint arXiv:1910.05422, 2019b.
 Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. International Conference on Learning Representations, 2018.
 Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. International Conference on Learning Representations, 2015a.
 Hu, H., Peng, R., Tai, Y.-W., and Tang, C.-K. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
 LeCun, Y., Denker, J. S., Solla, S. A., Howard, R. E., and Jackel, L. D. Optimal brain damage. Advances in Neural Information Processing Systems, volume 2, pp. 598–605, 1989
 Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. International Conference on Learning Representations, 2016.
 Liebenwein, L., Baykal, C., Lang, H., Feldman, D., and Rus, D. Provable filter pruning for efficient neural networks. International Conference on Learning Representations, 2020
 Pensia, A., Rajput, S., Nagle, A., Vishwakarma, H., and Papailiopoulos, D. Optimal lottery tickets via subset sum: Logarithmic over-parameterization is sufficient. Advances in Neural Information Processing Systems, 33, 2020.
 Orseau, L., Hutter, M., and Rivasplata, O. Logarithmic pruning is all you need. Advances in Neural Information Processing Systems, 33, 2020.
 Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., and Rastegari, M. What’s hidden in a randomly weighted neural network? Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11893–11902, 2020
 Ye, M., Gong, C., Nie, L., Zhou, D., Klivans, A., and Liu, Q. Good subnetworks provably exist: Pruning via greedy forward selection. In International Conference on Machine Learning, pp. 10820–10830, 2020a.
 Ye, M., Wu, L., and Liu, Q. Greedy optimization provably wins the lottery: Logarithmic number of winning tickets is enough. arXiv preprint arXiv:2010.15969, 2020b.