Transformers
Safetensors
Chinese
English
llama
qwen3
eagle3
text-generation-inference
File size: 3,100 Bytes
8bed2d6
 
 
 
 
 
 
 
 
 
7da33d3
8bed2d6
 
7da33d3
8bed2d6
 
 
 
 
 
 
 
 
 
c29548e
8bed2d6
 
 
 
b6952ff
8bed2d6
b6952ff
8bed2d6
 
 
 
 
 
 
 
 
 
 
b6952ff
8bed2d6
b6952ff
8bed2d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6952ff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
license: apache-2.0
datasets:
- Congliu/Chinese-DeepSeek-R1-Distill-data-110k
- cognitivecomputations/dolphin-r1
- a-m-team/AM-DeepSeek-R1-0528-Distilled
language:
- zh
- en
base_model:
- Zhihu-ai/Zhi-Create-Qwen3-32B
tags:
- qwen3
- eagle3
library_name: transformers
---


# Zhi-Create-Qwen3-32B-Eagle3

This is a speculator model designed for use with [Zhihu-ai/Zhi-Create-Qwen3-32B](https://huggingface.co/Zhihu-ai/Zhi-Create-Qwen3-32B), based on the [EAGLE-3](https://arxiv.org/abs/2503.01840) speculative decoding algorithm.
It was trained using the [SpecForge](https://github.com/sgl-project/SpecForge/) library on a subset of the Supervised Fine-tuning (SFT) Data from Zhihu-ai/Zhi-Create-Qwen3-32B.  
The model was trained in both thinking and non-thinking modes.

You can easily start a service using [SGLang](https://github.com/sgl-project/sglang).



```bash
pip install "sglang[all]>=0.4.9"

python3 -m sglang.launch_server --model Zhihu-ai/Zhi-Create-Qwen3-32B --speculative-algorithm EAGLE3 --speculative-draft-model-path Zhihu-ai/Zhi-Create-Qwen3-32B-Eagle3 --speculative-num-steps 3 --speculative-eagle-topk 2 --speculative-num-draft-tokens 8 --tp 2 --port 8000 --dtype bfloat16 --reasoning-parser deepseek-r1 --served-model-name Zhi-Create-Qwen3-32B

# send request
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Zhi-Create-Qwen3-32B",
        "prompt": "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章",
        "max_tokens": 4096,
        "temperature": 0.6,
        "top_p": 0.95
    }'
```

```python
# Alternative: Using OpenAI API
from openai import OpenAI
openai_api_key = "empty"
openai_api_base = "http://127.0.0.1:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base
)

def get_answer(messages):
    response = client.chat.completions.create(
        messages=messages,
        model="Zhi-Create-Qwen3-32B",
        max_tokens=4096,
        temperature=0.3,
        top_p=0.95,
        stream=True,
        extra_body = {"chat_template_kwargs": {"enable_thinking": True}}
    )
    answer = ""
    reasoning_content_all = ""
    for each in response:
        each_content = each.choices[0].delta.content
        if hasattr(each.choices[0].delta, "content"):
            each_content = each.choices[0].delta.content
        else:
            each_content = None
        if hasattr(each.choices[0].delta, "reasoning_content"):
            reasoning_content = each.choices[0].delta.reasoning_content
        else:
            reasoning_content = None
        if each_content is not None:
            answer += each_content
            print(each_content, end="", flush=True)
        if reasoning_content is not None:
            reasoning_content_all += reasoning_content
            print(reasoning_content, end="", flush=True)
    return answer, reasoning_content_all

prompt = "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章"
messages = [
    {"role": "user", "content": prompt}
]

answer, reasoning_content_all = get_answer(messages)
```