Prune Deep Networks
What is Pruning And Why we need it?
By increasing amounts of data and computational power, deep learning models have become bigger and deeper to better learn from data.
Deploying these large, accurate models to resource-constrained computing environments such as mobile phones, smart cameras, etc poses a few key challenges.
Confronting these challenges, a growing body of work has emerged that intends to discover methods for compressing neural network models while limiting any potential loss in model quality.
Model Pruning is a popular approach to reducing a heavy network to obtain a lightweight form by removing redundancy in the heavy network.
How does it Work?
Let us consider a neural network as a function family f(x; .). The architecture consists of the configuration of the network’s parameters and the sets of operations it uses to produce outputs from inputs, including the arrangement of parameters into convolutions, activation functions, pooling, batch normalization, etc. We define a neural network model as a particular parameterization of architecture, i.e., f(x; W) for specific parameters W. Neural network pruning entails taking as input a model f(x; W) and producing a new model f(x; M . W0). Here W0 is a set of parameters that may be different from W, M is a binary mask that fixes certain parameters to 0, and . is the elementwise product operator. In practice, rather than using an explicit mask, pruned parameters of W are fixed to zero or removed entirely.
Pruning From Scratch
This method prunes redundant connections using a three-step method.
1 — Train the network to learn which connections are important.
2 —Prune the unimportant connections. remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer.
3 — Retrain the network to fine-tune the weights of the remaining connections.
The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.
The last step is critical. If the pruned network is used without retraining, accuracy is significantly impacted.
During retraining, it is better to retain the weights from the initial training phase for the connections that survived pruning than it is to re-initialize the pruned layers. CNN's contain fragile co-adapted features: gradient descent is able to find a good solution when the network is initially trained, but not after re-initializing some layers and retraining them. So when we retrain the pruned layers, we should keep the surviving parameters instead of re-initializing them.
Also, neural networks are prone to suffer the vanishing gradient problem as the networks get deeper, which makes pruning errors harder to recover for deep networks. To prevent this, we fix the parameters for CONV layers and only retrain the FC layers after pruning the FC layers, and vice versa.
Regularization
Choosing the correct regularization impacts the performance of pruning and retraining. L1 regularization penalizes non-zero parameters resulting in more parameters near zero. This gives better accuracy after pruning, but before retraining. However, the remaining connections are not as good as with L2 regularization, resulting in lower accuracy after retraining. Overall, L2 regularization gives the best pruning results.
Dropout
Dropout is widely used to prevent over-fitting, and this also applies to retraining. During retraining, however, the dropout ratio must be adjusted to account for the change in model capacity. As the parameters get sparse, the classifier will select the most informative predictors and thus have much less prediction variance, which reduces over-fitting. As pruning already reduced model capacity, the retraining dropout ratio should be smaller.
Pruning Neurons
After pruning connections, neurons with zero input connections or zero output connections may be safely pruned. This pruning is furthered by removing all connections to or from a pruned neuron. The retraining phase automatically arrives at the result where dead neurons will have both zero input connections and zero output connections. This occurs due to gradient descent and regularization. A neuron that has zero input connections (or zero output connections) will have no contribution to the final loss, leading the gradient to be zero for its output connection (or input connection), respectively. Only the regularization term will push the weights to zero. Thus, the dead neurons will be automatically removed during retraining.
Differences Between Pruning Methods
Pruning methods vary primarily in their choices regarding sparsity structure, scoring, scheduling, and fine-tuning:
Structure:
Some methods prune individual parameters (unstructured pruning). Doing so produces a sparse neural network, which — although smaller in terms of parameter count — may not be arranged in a fashion conducive to speedups using modern libraries and hardware. Other methods consider parameters in groups (structured pruning), removing entire neurons, filters, or channels to exploit hardware and software optimized for dense computation
Scoring:
It is common to score parameters based on their absolute values, trained importance coefficients, or contributions to network activations or gradients. Some pruning methods compare scores locally, pruning a fraction of the parameters with the lowest scores within each structural subcomponent of the network.
Others consider scores globally, comparing scores to one another irrespective of the part of the network in which the parameter resides
Scheduling:
Pruning methods differ in the amount of the network to prune at each step. Some methods prune all desired weights at once in a single step. Others prune a fixed fraction of the network iteratively over several steps or vary the rate of pruning according to a more complex function.
Fine-tuning:
For methods that involve fine-tuning, it is most common to continue to train the network using the trained weights from before pruning. Alternative proposals include rewinding the network to an earlier state and reinitializing the network entirely.
How Effective is Pruning?
It has been repeatedly shown that, at least for large amounts of pruning, many pruning methods outperform random pruning. Interestingly, this does not always hold for small amounts of pruning,
Similarly, pruning all layers uniformly tends to perform worse than intelligently allocating parameters to different layers or pruning globally.
Lastly, when holding the number of fine-tuning iterations constant, many methods produce pruned models that outperform retraining from scratch with the same sparsity pattern with a large enough amount of pruning.
Retraining from scratch in this context means training a fresh, randomly-initialized model with all weights clamped to zero throughout training, except those that are nonzero in the pruned model.
Another consistent finding is that sparse models tend to outperform dense ones for a fixed number of parameters.
Perhaps most compelling of all is the many results, showing that pruned models can obtain higher accuracies than the original models from which they are derived. This demonstrates that sparse models can not only outperform dense counterparts with the same number of parameters but sometimes dense models with even more parameters.
References:
Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for an efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015.