DeepSeek-R1-quantized.w4a16

Model Overview

Model Architecture: DeepseekV3ForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Activation quantization: None
- Weight quantization: INT4
Release Date: 04/15/2025
Version: 1.0
Model Developers: Red Hat (Neural Magic)

Model Optimizations

This model was obtained by quantizing weights of DeepSeek-R1 to INT4 data type. This optimization reduces the number of bits used to represent weights from 8 to 4, reducing GPU memory requirements (by approximately 50%). Weight quantization also reduces disk size requirements by approximately 50%.

Deployment

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/DeepSeek-R1-quantized.w4a16"
number_gpus = 8

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Evaluation

The model was evaluated on the OpenLLM leaderboard task (v1) via lm-evaluation-harness, and on popular reasoning tasks (AIME 2024, MATH-500, GPQA-Diamond) via LightEval. For reasoning evaluations, we estimate pass@1 based on 10 runs with different seeds.

Evaluation details

OpenLLM v1

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/DeepSeek-R1-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto

Reasoning Benchmarks

export MODEL_ARGS="pretrained=RedHatAI/DeepSeek-R1-quantized.w4a16,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":42}"
export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:diamond|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

Accuracy

	Recovery (%)	deepseek/DeepSeek-R1	RedHatAI/DeepSeek-R1-quantized.w4a16 (this model)
ARC-Challenge 25-shot	100.00	72.53	72.53
GSM8k 5-shot	99.76	95.91	95.68
HellaSwag 10-shot	100.07	89.30	89.36
MMLU 5-shot	99.74	87.22	86.99
TruthfulQA 0-shot	100.83	59.28	59.77
WinoGrande 5-shot	101.65	82.00	83.35
OpenLLM v1 Average Score	100.30	81.04	81.28
AIME 2024 pass@1	98.30	78.33	77.00
MATH-500 pass@1	99.84	97.24	97.08
GPQA Diamond pass@1	98.01	73.38	71.92
Reasoning Average Score	98.81	82.99	82.00