List resources on neural network quantization here. Quantization are moving from research to industry (I mean real applications) nowdays (as in the begining of 2019). Hoping that this list may help :)

The resources are categorized into sections which may contain several subsections. The categories should be easy to understand. The recommanded materials are marked with ★ . Leave comments to collaborate :)

Introductions

Resources that help people having a basic understanding in this field.

Researches

Quantization came into researcher’s eyesight even in the early days of deep learning.

Softwares

Mainly on softwares which enables quantized neural networks.

  • TensorQuant allows a transparent quantization simulation of existing DNN topologies during training and inference. TensorQuant supports generic quantization methods and allows experimental evaluation of the impact of the quantization on single layers as well as on the full topology.
  • Nervana Neural Network Distiller (2018) is a Python package for neural network compression research.
  • Nvidia TensorRT (2017) uses Calibration to improve accuracy of quantized network.
  • Post-training quantization is supported by TensorFlow, PyTorch, MxNet and so on.
  • Quantization-aware Training (CVPR paper, 2018) simulates quantization arithmetic in forwarding pass when training.
  • [MXNet][mxnet] provides example usage of quantization based on MDK-DNN Model Optimization and cuDNN.
  • MKL-DNN, as the acceralting library for Intel CPU, provides post-training quantization techniques and sound performance. See Lower Numerical Precision Deep Learning Inference and Training. MKL-DNN has been intergated into most popular frameworks such as TensorFlow, Caffe(2) and MXNet. They declare the support of 16bit low-precision during traning.
  • Gemmlowp (2015) is not a full linear algebra library, but focus on low-precision computing. Gemmlowp is used in TensorFlow Lite (conv and fc) to accerelate quantization arithemtic on CPU. Gemmlowp also provides lots quantization utilities such as SaturatingRoundingDoublingHighMul in Conv and so on.
  • QNNPACK (news, 2018) is mobile-optimized implementation of quantized neural network operators. QNNPACK is intergated into PyTorch/Caffe2. QNNPACK aims to improve performance for quantized neural networks only, and probably for mobile platforms only. It assumes that the model size is small, and designed particular kernels. We observed that QNNPACK outperforms most quantization dedicated accerelate library.