Neural Network Quantization Resources

List resources on neural network quantization here. Quantization are moving from research to industry (I mean real applications) nowdays (as in the begining of 2019). Hoping that this list may help :)

The resources are categorized into sections which may contain several subsections. The categories should be easy to understand. The recommanded materials are marked with ★ . Leave comments to collaborate :)

Introductions

Resources that help people having a basic understanding in this field.

★ Neural Network Quantization Introduction (2019) pays special attention to arithmetic behind quantization.
★ Quantization document of Nervana Neural Network Distiller (2018) introduces the key knowledge of quantization.
Making Neural Nets Work With Low Precision mainly talks about TensorFlow Lite with brief quantization introduction.
What I’ve learned about neural network quantization summarizes quantization related hardware support and software trend in 2017.

Researches

Quantization came into researcher’s eyesight even in the early days of deep learning.

Binary Network. The most significant advangate of binary network is that they don’t need multiplication anymore - transformed into logic operations.
- Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 (2016): neural networks with binary weights and activations at run-time and when computing the parameters’ gradient at train-time.
- Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations (2016) uses 1bit for forward and backward both (weight update remains 32bit). Get 51% Top1 accuracy for quantized AlexNet.
- BinaryConnect: Training Deep Neural Networks with binary weights during propagations (2016).
Ternary Weight Networks (2016): neural networks with weights constrained to +1, 0 and -1.
XNOR Network (2016): the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations.
★ Deep Compression (2016) assembles pruning, quantization and encoding to reduce the storage requirement by 35x (AlexNet) to 49x (VGG-19) without affecting their accuracy. The paper shows that 8 bit is required for quantized Convolution layer to avoid significant accuracy loss, while 4 bit is sufficent for Fully Connected layer.
Fixed point quantization of deep convolutional networks (Qualcomm, 2016) collects statistics of weights, activations and biases, and then performs a SQNR analysis to figure out the best bit-width for each layer. Their experiments show that in comparison to equal bitwidth settings, the fixed point DCNs with optimized bit width allocation offer > 20% reduction in the model size without any loss in accuracy on CIFAR-10 benchmark.
Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy shows only 0.1% accuracy drop with ternary network which is pretty impressive.
Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks seems still floating point while the point (expononent) of the quantized tensor is flexiable - learned from models.
★ TensorRT Calibration uses KL Divergence (2017) to find the best scale which maps FP32 to INT8. The KL Divergence measures the distribution of quantized and non-quantized activication output of each operator, to evaluate the information lossing of quantization. The mapping scale that has minial KL divergence is choosed.
High-Accuracy Low-Precision Training aims to maintain accuracy by using stochastic variance-reduced gradient to reduce gradient variance, and to combine this with a novel technique called bit centering to reduce quantization error.
Mixed precision training of CNNs using integer operations (2018) uses Dynamic Fixed Point technique which achieves of exceeds the SOTA network (ResNet-50, GoogLeNet-v1, VGG-16 and AlexNet) on [ImageNet][imangenet] with 1.8X performance improvement.
Two-Step Quantization for Low-bit Neural Networks (2018): the two steps are code learning and transformation function learning based on the learned codes. The authors tried their method with different bits, and for binary and ternary weight quantization of AlexNet, they outperforming SOTA work.
Weighted-Entropy-based Quantization for Deep Neural Networks propsed a scheme which can chose the quantization bits according to the accuracy target. Authors performed experiments not only on image classification targets but also segmentation and natual language modeling.

Softwares

Mainly on softwares which enables quantized neural networks.

TensorQuant allows a transparent quantization simulation of existing DNN topologies during training and inference. TensorQuant supports generic quantization methods and allows experimental evaluation of the impact of the quantization on single layers as well as on the full topology.
★ Nervana Neural Network Distiller (2018) is a Python package for neural network compression research.
Nvidia TensorRT (2017) uses Calibration to improve accuracy of quantized network.
Post-training quantization is supported by TensorFlow, PyTorch, MxNet and so on.
★ Quantization-aware Training (CVPR paper, 2018) simulates quantization arithmetic in forwarding pass when training.
[MXNet][mxnet] provides example usage of quantization based on MDK-DNN Model Optimization and cuDNN.
MKL-DNN, as the acceralting library for Intel CPU, provides post-training quantization techniques and sound performance. See Lower Numerical Precision Deep Learning Inference and Training. MKL-DNN has been intergated into most popular frameworks such as TensorFlow, Caffe(2) and MXNet. They declare the support of 16bit low-precision during traning.
★ Gemmlowp (2015) is not a full linear algebra library, but focus on low-precision computing. Gemmlowp is used in TensorFlow Lite (conv and fc) to accerelate quantization arithemtic on CPU. Gemmlowp also provides lots quantization utilities such as SaturatingRoundingDoublingHighMul in Conv and so on.
★ QNNPACK (news, 2018) is mobile-optimized implementation of quantized neural network operators. QNNPACK is intergated into PyTorch/Caffe2. QNNPACK aims to improve performance for quantized neural networks only, and probably for mobile platforms only. It assumes that the model size is small, and designed particular kernels. We observed that QNNPACK outperforms most quantization dedicated accerelate library.