|
|
|
QDQBERT |
|
Overview |
|
The QDQBERT model can be referenced in Integer Quantization for Deep Learning Inference: Principles and Empirical |
|
Evaluation by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius |
|
Micikevicius. |
|
The abstract from the paper is the following: |
|
Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by |
|
taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of |
|
quantization parameters and evaluate their choices on a wide range of neural network models for different application |
|
domains, including vision, speech, and language. We focus on quantization techniques that are amenable to acceleration |
|
by processors with high-throughput integer math pipelines. We also present a workflow for 8-bit quantization that is |
|
able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are |
|
more difficult to quantize, such as MobileNets and BERT-large. |
|
This model was contributed by shangz. |
|
Usage tips |
|
|
|
QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to (i) linear layer |
|
inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model. |
|
QDQBERT requires the dependency of Pytorch Quantization Toolkit. To install pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com |
|
QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example google-bert/bert-base-uncased), and |
|
perform Quantization Aware Training/Post Training Quantization. |
|
A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for |
|
SQUAD task can be found at transformers/examples/research_projects/quantization-qdqbert/. |
|
|
|
Set default quantizers |
|
QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to BERT by |
|
TensorQuantizer in Pytorch Quantization Toolkit. TensorQuantizer is the module |
|
for quantizing tensors, with QuantDescriptor defining how the tensor should be quantized. Refer to Pytorch |
|
Quantization Toolkit userguide for more details. |
|
Before creating QDQBERT model, one has to set the default QuantDescriptor defining default tensor quantizers. |
|
Example: |
|
thon |
|
|
|
import pytorch_quantization.nn as quant_nn |
|
from pytorch_quantization.tensor_quant import QuantDescriptor |
|
The default tensor quantizer is set to use Max calibration method |
|
input_desc = QuantDescriptor(num_bits=8, calib_method="max") |
|
The default tensor quantizer is set to be per-channel quantization for weights |
|
weight_desc = QuantDescriptor(num_bits=8, axis=((0,))) |
|
quant_nn.QuantLinear.set_default_quant_desc_input(input_desc) |
|
quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc) |
|
|
|
Calibration |
|
Calibration is the terminology of passing data samples to the quantizer and deciding the best scaling factors for |
|
tensors. After setting up the tensor quantizers, one can use the following example to calibrate the model: |
|
thon |
|
|
|
Find the TensorQuantizer and enable calibration |
|
for name, module in model.named_modules(): |
|
if name.endswith("_input_quantizer"): |
|
module.enable_calib() |
|
module.disable_quant() # Use full precision data to calibrate |
|
Feeding data samples |
|
model(x) |
|
|
|
Finalize calibration |
|
for name, module in model.named_modules(): |
|
if name.endswith("_input_quantizer"): |
|
module.load_calib_amax() |
|
module.enable_quant() |
|
If running on GPU, it needs to call .cuda() again because new tensors will be created by calibration process |
|
model.cuda() |
|
Keep running the quantized model |
|
|
|
Export to ONNX |
|
The goal of exporting to ONNX is to deploy inference by TensorRT. Fake |
|
quantization will be broken into a pair of QuantizeLinear/DequantizeLinear ONNX ops. After setting static member of |
|
TensorQuantizer to use Pytorch’s own fake quantization functions, fake quantized model can be exported to ONNX, follow |
|
the instructions in torch.onnx. Example: |
|
thon |
|
|
|
from pytorch_quantization.nn import TensorQuantizer |
|
TensorQuantizer.use_fb_fake_quant = True |
|
Load the calibrated model |
|
|
|
ONNX export |
|
torch.onnx.export() |
|
|
|
Resources |
|
|
|
Text classification task guide |
|
Token classification task guide |
|
Question answering task guide |
|
Causal language modeling task guide |
|
Masked language modeling task guide |
|
Multiple choice task guide |
|
|
|
QDQBertConfig |
|
[[autodoc]] QDQBertConfig |
|
QDQBertModel |
|
[[autodoc]] QDQBertModel |
|
- forward |
|
QDQBertLMHeadModel |
|
[[autodoc]] QDQBertLMHeadModel |
|
- forward |
|
QDQBertForMaskedLM |
|
[[autodoc]] QDQBertForMaskedLM |
|
- forward |
|
QDQBertForSequenceClassification |
|
[[autodoc]] QDQBertForSequenceClassification |
|
- forward |
|
QDQBertForNextSentencePrediction |
|
[[autodoc]] QDQBertForNextSentencePrediction |
|
- forward |
|
QDQBertForMultipleChoice |
|
[[autodoc]] QDQBertForMultipleChoice |
|
- forward |
|
QDQBertForTokenClassification |
|
[[autodoc]] QDQBertForTokenClassification |
|
- forward |
|
QDQBertForQuestionAnswering |
|
[[autodoc]] QDQBertForQuestionAnswering |
|
- forward |