Files changed (1) hide show
  1. README.md +189 -177
README.md CHANGED
@@ -1,177 +1,189 @@
1
- ---
2
- tags:
3
- - int8
4
- - vllm
5
- - llm-compressor
6
- language:
7
- - en
8
- pipeline_tag: text-generation
9
- license: apache-2.0
10
- base_model:
11
- - Qwen/Qwen2.5-0.5B
12
- ---
13
-
14
- # Qwen2.5-0.5B-quantized.w8a16
15
-
16
- ## Model Overview
17
- - **Model Architecture:** Qwen2
18
- - **Input:** Text
19
- - **Output:** Text
20
- - **Model Optimizations:**
21
- - **Weight quantization:** INT8
22
- - **Intended Use Cases:** Similarly to [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B), this is a base language model.
23
- - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
24
- - **Release Date:** 10/09/2024
25
- - **Version:** 1.0
26
- - **Model Developers:** Neural Magic
27
-
28
- Quantized version of [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B).
29
- It achieves an OpenLLMv1 score of 43.9, compared to 44.0 for [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B).
30
-
31
- ### Model Optimizations
32
-
33
- This model was obtained by quantizing the weights of [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) to INT8 data type.
34
- This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
35
-
36
- Only the weights of the linear operators within transformers blocks are quantized.
37
- Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
38
- The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
39
-
40
-
41
- ## Deployment
42
-
43
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
44
-
45
- ```python
46
- from vllm import LLM, SamplingParams
47
- from transformers import AutoTokenizer
48
-
49
- model_id = "neuralmagic/Qwen2.5-0.5B-quantized.w8a16"
50
- number_gpus = 1
51
- max_model_len = 8192
52
-
53
- sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
54
-
55
- tokenizer = AutoTokenizer.from_pretrained(model_id)
56
-
57
- prompt = "Give me a short introduction to large language model."
58
-
59
- llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
60
-
61
- outputs = llm.generate(prompt, sampling_params)
62
-
63
- generated_text = outputs[0].outputs[0].text
64
- print(generated_text)
65
- ```
66
-
67
- vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
68
-
69
-
70
-
71
- ## Evaluation
72
-
73
- The model was evaluated on the OpenLLMv1 benchmark, composed of MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
74
- Evaluation was conducted using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
75
-
76
- ### Accuracy
77
-
78
- <table>
79
- <tr>
80
- <td><strong>Category</strong>
81
- </td>
82
- <td><strong>Benchmark</strong>
83
- </td>
84
- <td><strong>Qwen2.5-0.5B</strong>
85
- </td>
86
- <td><strong>Qwen2.5-0.5B-quantized.w8a16<br>(this model)</strong>
87
- </td>
88
- <td><strong>Recovery</strong>
89
- </td>
90
- </tr>
91
- <tr>
92
- <td rowspan="8" ><strong>OpenLLM v1</strong>
93
- </td>
94
- </tr>
95
- <tr>
96
- <td>MMLU (5-shot)
97
- </td>
98
- <td>47.57
99
- </td>
100
- <td>47.81
101
- </td>
102
- <td>100.5%
103
- </td>
104
- </tr>
105
- <tr>
106
- <td>ARC Challenge (25-shot)
107
- </td>
108
- <td>34.90
109
- </td>
110
- <td>34.90
111
- </td>
112
- <td>100.0%
113
- </td>
114
- </tr>
115
- <tr>
116
- <td>GSM-8k (5-shot, strict-match)
117
- </td>
118
- <td>34.19
119
- </td>
120
- <td>33.51
121
- </td>
122
- <td>98.0%
123
- </td>
124
- </tr>
125
- <tr>
126
- <td>Hellaswag (10-shot)
127
- </td>
128
- <td>51.83
129
- </td>
130
- <td>51.78
131
- </td>
132
- <td>99.9%
133
- </td>
134
- </tr>
135
- <tr>
136
- <td>Winogrande (5-shot)
137
- </td>
138
- <td>55.80
139
- </td>
140
- <td>55.49
141
- </td>
142
- <td>99.4%
143
- </td>
144
- </tr>
145
- <tr>
146
- <td>TruthfulQA (0-shot, mc2)
147
- </td>
148
- <td>39.90
149
- </td>
150
- <td>39.71
151
- </td>
152
- <td>99.5%
153
- </td>
154
- </tr>
155
- <tr>
156
- <td><strong>Average</strong>
157
- </td>
158
- <td><strong>44.0</strong>
159
- </td>
160
- <td><strong>43.9</strong>
161
- </td>
162
- <td><strong>99.6%</strong>
163
- </td>
164
- </tr>
165
- </table>
166
-
167
- ### Reproduction
168
-
169
- The results were obtained using the following command:
170
-
171
- ```
172
- lm_eval \
173
- --model vllm \
174
- --model_args pretrained="neuralmagic/Qwen2.5-0.5B-quantized.w8a16",dtype=auto,max_model_len=4096,add_bos_token=True,tensor_parallel_size=1 \
175
- --tasks openllm \
176
- --batch_size auto
177
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - int8
4
+ - vllm
5
+ - llm-compressor
6
+ language:
7
+ - zho
8
+ - eng
9
+ - fra
10
+ - spa
11
+ - por
12
+ - deu
13
+ - ita
14
+ - rus
15
+ - jpn
16
+ - kor
17
+ - vie
18
+ - tha
19
+ - ara
20
+ pipeline_tag: text-generation
21
+ license: apache-2.0
22
+ base_model:
23
+ - Qwen/Qwen2.5-0.5B
24
+ ---
25
+
26
+ # Qwen2.5-0.5B-quantized.w8a16
27
+
28
+ ## Model Overview
29
+ - **Model Architecture:** Qwen2
30
+ - **Input:** Text
31
+ - **Output:** Text
32
+ - **Model Optimizations:**
33
+ - **Weight quantization:** INT8
34
+ - **Intended Use Cases:** Similarly to [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B), this is a base language model.
35
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
36
+ - **Release Date:** 10/09/2024
37
+ - **Version:** 1.0
38
+ - **Model Developers:** Neural Magic
39
+
40
+ Quantized version of [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B).
41
+ It achieves an OpenLLMv1 score of 43.9, compared to 44.0 for [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B).
42
+
43
+ ### Model Optimizations
44
+
45
+ This model was obtained by quantizing the weights of [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) to INT8 data type.
46
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
47
+
48
+ Only the weights of the linear operators within transformers blocks are quantized.
49
+ Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
50
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
51
+
52
+
53
+ ## Deployment
54
+
55
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
56
+
57
+ ```python
58
+ from vllm import LLM, SamplingParams
59
+ from transformers import AutoTokenizer
60
+
61
+ model_id = "neuralmagic/Qwen2.5-0.5B-quantized.w8a16"
62
+ number_gpus = 1
63
+ max_model_len = 8192
64
+
65
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
66
+
67
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
68
+
69
+ prompt = "Give me a short introduction to large language model."
70
+
71
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
72
+
73
+ outputs = llm.generate(prompt, sampling_params)
74
+
75
+ generated_text = outputs[0].outputs[0].text
76
+ print(generated_text)
77
+ ```
78
+
79
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
80
+
81
+
82
+
83
+ ## Evaluation
84
+
85
+ The model was evaluated on the OpenLLMv1 benchmark, composed of MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
86
+ Evaluation was conducted using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
87
+
88
+ ### Accuracy
89
+
90
+ <table>
91
+ <tr>
92
+ <td><strong>Category</strong>
93
+ </td>
94
+ <td><strong>Benchmark</strong>
95
+ </td>
96
+ <td><strong>Qwen2.5-0.5B</strong>
97
+ </td>
98
+ <td><strong>Qwen2.5-0.5B-quantized.w8a16<br>(this model)</strong>
99
+ </td>
100
+ <td><strong>Recovery</strong>
101
+ </td>
102
+ </tr>
103
+ <tr>
104
+ <td rowspan="8" ><strong>OpenLLM v1</strong>
105
+ </td>
106
+ </tr>
107
+ <tr>
108
+ <td>MMLU (5-shot)
109
+ </td>
110
+ <td>47.57
111
+ </td>
112
+ <td>47.81
113
+ </td>
114
+ <td>100.5%
115
+ </td>
116
+ </tr>
117
+ <tr>
118
+ <td>ARC Challenge (25-shot)
119
+ </td>
120
+ <td>34.90
121
+ </td>
122
+ <td>34.90
123
+ </td>
124
+ <td>100.0%
125
+ </td>
126
+ </tr>
127
+ <tr>
128
+ <td>GSM-8k (5-shot, strict-match)
129
+ </td>
130
+ <td>34.19
131
+ </td>
132
+ <td>33.51
133
+ </td>
134
+ <td>98.0%
135
+ </td>
136
+ </tr>
137
+ <tr>
138
+ <td>Hellaswag (10-shot)
139
+ </td>
140
+ <td>51.83
141
+ </td>
142
+ <td>51.78
143
+ </td>
144
+ <td>99.9%
145
+ </td>
146
+ </tr>
147
+ <tr>
148
+ <td>Winogrande (5-shot)
149
+ </td>
150
+ <td>55.80
151
+ </td>
152
+ <td>55.49
153
+ </td>
154
+ <td>99.4%
155
+ </td>
156
+ </tr>
157
+ <tr>
158
+ <td>TruthfulQA (0-shot, mc2)
159
+ </td>
160
+ <td>39.90
161
+ </td>
162
+ <td>39.71
163
+ </td>
164
+ <td>99.5%
165
+ </td>
166
+ </tr>
167
+ <tr>
168
+ <td><strong>Average</strong>
169
+ </td>
170
+ <td><strong>44.0</strong>
171
+ </td>
172
+ <td><strong>43.9</strong>
173
+ </td>
174
+ <td><strong>99.6%</strong>
175
+ </td>
176
+ </tr>
177
+ </table>
178
+
179
+ ### Reproduction
180
+
181
+ The results were obtained using the following command:
182
+
183
+ ```
184
+ lm_eval \
185
+ --model vllm \
186
+ --model_args pretrained="neuralmagic/Qwen2.5-0.5B-quantized.w8a16",dtype=auto,max_model_len=4096,add_bos_token=True,tensor_parallel_size=1 \
187
+ --tasks openllm \
188
+ --batch_size auto
189
+ ```