ekurtic commited on
Commit
5be4a35
·
verified ·
1 Parent(s): a1a7f55

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +277 -12
README.md CHANGED
@@ -1,21 +1,286 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  base_model:
3
  - meta-llama/Llama-4-Scout-17B-16E-Instruct
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
- ## More details and evals coming soon...
6
 
7
- ## Sanity check - GSM8k eval
8
 
9
- - `meta-llama/Llama-4-Scout-17B-16E-Instruct` unquantized baseline
 
 
 
 
 
 
 
 
 
10
 
11
- |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
12
- |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
13
- |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9189|± |0.0075|
14
- | | |strict-match | 5|exact_match|↑ |0.9014|± |0.0082|
15
 
16
- - `RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic` FP8 quantized (this model)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
- |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
19
- |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
20
- |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9219|± |0.0074|
21
- | | |strict-match | 5|exact_match|↑ |0.9075|± |0.0080|
 
1
  ---
2
+ library_name: vllm
3
+ language:
4
+ - ar
5
+ - de
6
+ - en
7
+ - es
8
+ - fr
9
+ - hi
10
+ - id
11
+ - it
12
+ - pt
13
+ - th
14
+ - tl
15
+ - vi
16
  base_model:
17
  - meta-llama/Llama-4-Scout-17B-16E-Instruct
18
+ pipeline_tag: image-text-to-text
19
+ tags:
20
+ - facebook
21
+ - meta
22
+ - pytorch
23
+ - llama
24
+ - llama4
25
+ - neuralmagic
26
+ - redhat
27
+ - llmcompressor
28
+ - quantized
29
+ - FP8
30
+ license: other
31
+ license_name: llama4
32
  ---
 
33
 
34
+ # Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
35
 
36
+ ## Model Overview
37
+ - **Model Architecture:** Llama4ForConditionalGeneration
38
+ - **Input:** Text / Image
39
+ - **Output:** Text
40
+ - **Model Optimizations:**
41
+ - **Activation quantization:** FP8
42
+ - **Weight quantization:** FP8
43
+ - **Release Date:** 04/15/2025
44
+ - **Version:** 1.0
45
+ - **Model Developers:** RedHat (Neural Magic)
46
 
 
 
 
 
47
 
48
+ ### Model Optimizations
49
+
50
+ This model was obtained by quantizing activations and weights of [Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) to FP8 data type.
51
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
52
+ Weight quantization also reduces disk size requirements by approximately 50%. The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
53
+
54
+
55
+ ## Deployment
56
+
57
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
58
+
59
+ ```python
60
+ from vllm import LLM, SamplingParams
61
+ from transformers import AutoTokenizer
62
+
63
+ model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
64
+ number_gpus = 4
65
+
66
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
67
+
68
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
69
+
70
+ prompt = "Give me a short introduction to large language model."
71
+
72
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
73
+
74
+ outputs = llm.generate(prompt, sampling_params)
75
+
76
+ generated_text = outputs[0].outputs[0].text
77
+ print(generated_text)
78
+ ```
79
+
80
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
81
+
82
+ ## Creation
83
+
84
+ <details>
85
+ <summary>Creation details</summary>
86
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
87
+
88
+
89
+ ```python
90
+ #!/usr/bin/env python3
91
+ """
92
+ This script loads an LLM model and applies FP8 quantization to
93
+ weights and activations. Activations are dynamically quantized, i.e. during
94
+ actual runtime.
95
+ """
96
+
97
+ import argparse
98
+ import torch
99
+ from transformers import AutoTokenizer, AutoModelForCausalLM, Llama4ForConditionalGeneration
100
+ from llmcompressor.modifiers.quantization import QuantizationModifier
101
+ from llmcompressor import oneshot
102
+ from compressed_tensors.quantization import (
103
+ QuantizationScheme,
104
+ QuantizationArgs,
105
+ QuantizationType,
106
+ QuantizationStrategy,
107
+ )
108
+
109
+
110
+ def parse_arguments():
111
+ """Parse command line arguments."""
112
+ parser = argparse.ArgumentParser(description="Quantize a causal language model")
113
+ parser.add_argument(
114
+ "--model_path",
115
+ type=str,
116
+ required=True,
117
+ help="Path to the pre-trained model",
118
+ )
119
+ parser.add_argument(
120
+ "--quant_path",
121
+ type=str,
122
+ required=True,
123
+ help="Output path for the quantized model",
124
+ )
125
+ return parser.parse_args()
126
+
127
+
128
+ def main():
129
+ """Main function to load and quantize the model."""
130
+ args = parse_arguments()
131
+
132
+ print(f"Loading model from {args.model_path}...")
133
+ model = Llama4ForConditionalGeneration.from_pretrained(
134
+ args.model_path,
135
+ device_map="auto",
136
+ torch_dtype="auto",
137
+ trust_remote_code=True,
138
+ )
139
+
140
+ quant_scheme = QuantizationScheme(
141
+ targets=["Linear"],
142
+ weights=QuantizationArgs(
143
+ num_bits=8,
144
+ type=QuantizationType.FLOAT,
145
+ strategy=QuantizationStrategy.CHANNEL,
146
+ symmetric=True,
147
+ observer="mse",
148
+ ),
149
+ input_activations=QuantizationArgs(
150
+ num_bits=8,
151
+ type=QuantizationType.FLOAT,
152
+ strategy=QuantizationStrategy.TOKEN,
153
+ symmetric=True,
154
+ dynamic=True,
155
+ ),
156
+ output_activations=None,
157
+ )
158
+
159
+ recipe = QuantizationModifier(
160
+ targets="Linear",
161
+ config_groups={"group_0": quant_scheme},
162
+ ignore=[
163
+ 're:.*lm_head',
164
+ 're:.*self_attn',
165
+ 're:.*router',
166
+ 're:.*vision_model',
167
+ 're:.*multi_modal_projector',
168
+ ]
169
+ )
170
+
171
+ print("Applying quantization...")
172
+ oneshot(
173
+ model=model,
174
+ recipe=recipe,
175
+ trust_remote_code_model=True,
176
+ )
177
+
178
+ model.save_pretrained(args.quant_path, save_compressed=True, skip_compression_stats=True, disable_sparse_compression=True)
179
+ print(f"Quantized model saved to {args.quant_path}")
180
+
181
+
182
+ if __name__ == "__main__":
183
+ main()
184
+ ```
185
+ </details>
186
+
187
+
188
+
189
+ ## Evaluation
190
+
191
+ The model was evaluated on the OpenLLM leaderboard tasks (v1 and v2), long context RULER, multimodal MMMU, and multimodal ChartQA.
192
+ All evaluations are obtained through [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
193
+
194
+ <details>
195
+ <summary>Evaluation details</summary>
196
+
197
+ **OpenLLM v1**
198
+ ```
199
+ lm_eval \
200
+ --model vllm \
201
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \
202
+ --tasks openllm \
203
+ --batch_size auto
204
+ ```
205
+
206
+ **OpenLLM v2**
207
+ ```
208
+ lm_eval \
209
+ --model vllm \
210
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=16384,tensor_parallel_size=8,gpu_memory_utilization=0.5,enable_chunked_prefill=True,trust_remote_code=True \
211
+ --tasks leaderboard \
212
+ --apply_chat_template \
213
+ --fewshot_as_multiturn \
214
+ --batch_size auto
215
+ ```
216
+
217
+ **Long Context RULER**
218
+ ```
219
+ lm_eval \
220
+ --model vllm \
221
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=524288,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True \
222
+ --tasks ruler \
223
+ --metadata='{"max_seq_lengths":[131072]}' \
224
+ --batch_size auto
225
+ ```
226
+
227
+ **Multimodal MMMU**
228
+ ```
229
+ lm_eval \
230
+ --model vllm-vlm \
231
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
232
+ --tasks mmmu_val \
233
+ --apply_chat_template \
234
+ --batch_size auto
235
+ ```
236
+
237
+ **Multimodal ChartQA**
238
+ ```
239
+ export VLLM_MM_INPUT_CACHE_GIB=8
240
+ lm_eval \
241
+ --model vllm-vlm \
242
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
243
+ --tasks chartqa \
244
+ --apply_chat_template \
245
+ --batch_size auto
246
+ ```
247
+
248
+ </details>
249
+
250
+ ### Accuracy
251
+
252
+ | | Recovery (%) | meta-llama/Llama-4-Scout-17B-16E-Instruct | RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic<br>(this model) |
253
+ | ---------------------------------------------- | :-----------: | :---------------------------------------: | :-----------------------------------------------------------------: |
254
+ | ARC-Challenge<br>25-shot | 100.36 | 69.37 | 69.62 |
255
+ | GSM8k<br>5-shot | 99.24 | 90.45 | 89.76 |
256
+ | HellaSwag<br>10-shot | 99.94 | 85.23 | 85.18 |
257
+ | MMLU<br>5-shot | 99.94 | 80.54 | 80.49 |
258
+ | TruthfulQA<br>0-shot | 99.17 | 61.41 | 60.90 |
259
+ | WinoGrande<br>5-shot | 98.88 | 77.90 | 77.03 |
260
+ | **OpenLLM v1<br>Average Score** | **99.59** | **77.48** | **77.16** |
261
+ | IFEval<br>0-shot<br>avg of inst and prompt acc | 100.91 | 86.90 | 87.69 |
262
+ | Big Bench Hard<br>3-shot | 99.82 | 65.13 | 65.01 |
263
+ | Math Lvl 5<br>4-shot | 98.82 | 57.78 | 57.10 |
264
+ | GPQA<br>0-shot | 100.53 | 31.88 | 32.05 |
265
+ | MuSR<br>0-shot | 102.18 | 42.20 | 43.12 |
266
+ | MMLU-Pro<br>5-shot | 99.82 | 55.70 | 55.60 |
267
+ | **OpenLLM v2<br>Average Score** | **100.28** | **56.60** | **56.76** |
268
+ | RULER<br>seqlen = 131072<br>niah_multikey_1 | 101.36 | 88.20 | 89.40 |
269
+ | RULER<br>seqlen = 131072<br>niah_multikey_2 | 100.72 | 83.60 | 84.20 |
270
+ | RULER<br>seqlen = 131072<br>niah_multikey_3 | 96.19 | 78.80 | 75.80 |
271
+ | RULER<br>seqlen = 131072<br>niah_multiquery | 100.79 | 95.40 | 96.15 |
272
+ | RULER<br>seqlen = 131072<br>niah_multivalue | 97.22 | 73.75 | 71.70 |
273
+ | RULER<br>seqlen = 131072<br>niah_single_1 | 100.00 | 100.00 | 100.00 |
274
+ | RULER<br>seqlen = 131072<br>niah_single_2 | 100.00 | 99.80 | 99.80 |
275
+ | RULER<br>seqlen = 131072<br>niah_single_3 | 100.00 | 99.80 | 99.80 |
276
+ | RULER<br>seqlen = 131072<br>ruler_cwe | 96.19 | 39.42 | 37.92 |
277
+ | RULER<br>seqlen = 131072<br>ruler_fwe | 98.86 | 92.93 | 91.87 |
278
+ | RULER<br>seqlen = 131072<br>ruler_qa_hotpot | 100.00 | 48.20 | 48.20 |
279
+ | RULER<br>seqlen = 131072<br>ruler_qa_squad | 98.81 | 53.57 | 52.93 |
280
+ | RULER<br>seqlen = 131072<br>ruler_qa_vt | 100.35 | 92.28 | 92.60 |
281
+ | **RULER<br>seqlen = 131072<br>Average Score** | **99.49** | **80.44** | **80.03** |
282
+ | MMMU<br>0-shot | 97.92 | 53.44 | 52.33 |
283
+ | ChartQA<br>0-shot<br>exact_match | 100.12 | 65.88 | 65.96 |
284
+ | ChartQA<br>0-shot<br>relaxed_accuracy | 99.69 | 88.92 | 88.64 |
285
+ | **Multimodal Average Score** | **99.38** | **69.41** | **68.98** |
286