File size: 5,020 Bytes
57bdca5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114

QDQBERT
Overview
The QDQBERT model can be referenced in Integer Quantization for Deep Learning Inference: Principles and Empirical
Evaluation by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius
Micikevicius.
The abstract from the paper is the following:
Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by
taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of
quantization parameters and evaluate their choices on a wide range of neural network models for different application
domains, including vision, speech, and language. We focus on quantization techniques that are amenable to acceleration
by processors with high-throughput integer math pipelines. We also present a workflow for 8-bit quantization that is
able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are
more difficult to quantize, such as MobileNets and BERT-large.
This model was contributed by shangz.
Usage tips

QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to (i) linear layer
  inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model.
QDQBERT requires the dependency of Pytorch Quantization Toolkit. To install pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com
QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example google-bert/bert-base-uncased), and
  perform Quantization Aware Training/Post Training Quantization.
A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for
  SQUAD task can be found at transformers/examples/research_projects/quantization-qdqbert/.

Set default quantizers
QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to BERT by
TensorQuantizer in Pytorch Quantization Toolkit. TensorQuantizer is the module
for quantizing tensors, with QuantDescriptor defining how the tensor should be quantized. Refer to Pytorch
Quantization Toolkit userguide for more details.
Before creating QDQBERT model, one has to set the default QuantDescriptor defining default tensor quantizers.
Example:
thon

import pytorch_quantization.nn as quant_nn
from pytorch_quantization.tensor_quant import QuantDescriptor
The default tensor quantizer is set to use Max calibration method
input_desc = QuantDescriptor(num_bits=8, calib_method="max")
The default tensor quantizer is set to be per-channel quantization for weights
weight_desc = QuantDescriptor(num_bits=8, axis=((0,)))
quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)

Calibration
Calibration is the terminology of passing data samples to the quantizer and deciding the best scaling factors for
tensors. After setting up the tensor quantizers, one can use the following example to calibrate the model:
thon

Find the TensorQuantizer and enable calibration
for name, module in model.named_modules():
     if name.endswith("_input_quantizer"):
         module.enable_calib()
         module.disable_quant()  # Use full precision data to calibrate
Feeding data samples
model(x)

Finalize calibration
for name, module in model.named_modules():
     if name.endswith("_input_quantizer"):
         module.load_calib_amax()
         module.enable_quant()
If running on GPU, it needs to call .cuda() again because new tensors will be created by calibration process
model.cuda()
Keep running the quantized model

Export to ONNX
The goal of exporting to ONNX is to deploy inference by TensorRT. Fake
quantization will be broken into a pair of QuantizeLinear/DequantizeLinear ONNX ops. After setting static member of
TensorQuantizer to use Pytorch’s own fake quantization functions, fake quantized model can be exported to ONNX, follow
the instructions in torch.onnx. Example:
thon

from pytorch_quantization.nn import TensorQuantizer
TensorQuantizer.use_fb_fake_quant = True
Load the calibrated model

ONNX export
torch.onnx.export()

Resources

Text classification task guide
Token classification task guide
Question answering task guide
Causal language modeling task guide
Masked language modeling task guide
Multiple choice task guide

QDQBertConfig
[[autodoc]] QDQBertConfig
QDQBertModel
[[autodoc]] QDQBertModel
    - forward
QDQBertLMHeadModel
[[autodoc]] QDQBertLMHeadModel
    - forward
QDQBertForMaskedLM
[[autodoc]] QDQBertForMaskedLM
    - forward
QDQBertForSequenceClassification
[[autodoc]] QDQBertForSequenceClassification
    - forward
QDQBertForNextSentencePrediction
[[autodoc]] QDQBertForNextSentencePrediction
    - forward
QDQBertForMultipleChoice
[[autodoc]] QDQBertForMultipleChoice
    - forward
QDQBertForTokenClassification
[[autodoc]] QDQBertForTokenClassification
    - forward
QDQBertForQuestionAnswering
[[autodoc]] QDQBertForQuestionAnswering
    - forward