DeepSeek-R1-quantized.w4a16

Model Overview

  • Model Architecture: DeepseekV3ForCausalLM
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Activation quantization: None
    • Weight quantization: INT4
  • Release Date: 04/15/2025
  • Version: 1.0
  • Model Developers: Red Hat (Neural Magic)

Model Optimizations

This model was obtained by quantizing weights of DeepSeek-R1 to INT4 data type. This optimization reduces the number of bits used to represent weights from 8 to 4, reducing GPU memory requirements (by approximately 50%). Weight quantization also reduces disk size requirements by approximately 50%.

Deployment

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/DeepSeek-R1-quantized.w4a16"
number_gpus = 8

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Evaluation

The model was evaluated on the OpenLLM leaderboard task (v1) via lm-evaluation-harness, and on popular reasoning tasks (AIME 2024, MATH-500, GPQA-Diamond) via LightEval. For reasoning evaluations, we estimate pass@1 based on 10 runs with different seeds.

Evaluation details

OpenLLM v1

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/DeepSeek-R1-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto

Reasoning Benchmarks

export MODEL_ARGS="pretrained=RedHatAI/DeepSeek-R1-quantized.w4a16,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":42}"
export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:diamond|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

Accuracy

Recovery (%) deepseek/DeepSeek-R1 RedHatAI/DeepSeek-R1-quantized.w4a16
(this model)
ARC-Challenge
25-shot
100.00 72.53 72.53
GSM8k
5-shot
99.76 95.91 95.68
HellaSwag
10-shot
100.07 89.30 89.36
MMLU
5-shot
99.74 87.22 86.99
TruthfulQA
0-shot
100.83 59.28 59.77
WinoGrande
5-shot
101.65 82.00 83.35
OpenLLM v1
Average Score
100.30 81.04 81.28
AIME 2024
pass@1
98.30 78.33 77.00
MATH-500
pass@1
99.84 97.24 97.08
GPQA Diamond
pass@1
98.01 73.38 71.92
Reasoning
Average Score
98.81 82.99 82.00
Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RedHatAI/DeepSeek-R1-quantized.w4a16

Finetuned
(302)
this model