Transformers
Safetensors
Chinese
English
llama
qwen3
eagle3
text-generation-inference
Parkerlambert123 commited on
Commit
8bed2d6
·
verified ·
1 Parent(s): ef79d95

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ images/writingbench_score.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,261 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Congliu/Chinese-DeepSeek-R1-Distill-data-110k
5
+ - cognitivecomputations/dolphin-r1
6
+ - a-m-team/AM-DeepSeek-R1-0528-Distilled
7
+ language:
8
+ - zh
9
+ - en
10
+ base_model:
11
+ - Qwen/Qwen3-32B
12
+ tags:
13
+ - qwen3
14
+ library_name: transformers
15
+ ---
16
+
17
+
18
+ # Zhi-Create-Qwen3-32B-Eagle3
19
+
20
+ This is a speculator model designed for use with [Zhihu-ai/Zhi-Create-Qwen3-32B](https://huggingface.co/Zhihu-ai/Zhi-Create-Qwen3-32B), based on the [EAGLE-3](https://arxiv.org/abs/2503.01840) speculative decoding algorithm.
21
+ It was trained using the [SpecForge](https://github.com/sgl-project/SpecForge/) library on a subset of the Supervised Fine-tuning (SFT) Data from Zhihu-ai/Zhi-Create-Qwen3-32B.
22
+ The model was trained in both thinking and non-thinking modes.
23
+
24
+
25
+ # Zhi-Create-Qwen3-32B
26
+
27
+ ## 1. Introduction
28
+
29
+ Zhi-Create-Qwen3-32B is a fine-tuned model derived from [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B), with a focus on enhancing creative writing capabilities. Through careful optimization, the model shows promising improvements in creative writing performance, as evaluated using the [WritingBench](https://github.com/X-PLUG/WritingBench). In our evaluation, the model attains a score of **82.08** on WritingBench, which represents a significant improvement over the base Qwen3-32B model's score of **78.97**.
30
+
31
+ Additionally, to maintain the model's general capabilities such as knowledge and reasoning, we performed fine-grained data mixture experiments by combining general knowledge, mathematics, code, and other data types. The final evaluation results show that general capabilities remain stable with no significant decline compared to the base model.
32
+
33
+ ## 2. Training Process
34
+
35
+ ### Data
36
+
37
+ The model's training corpus comprises three primary data sources: rigorously filtered open-source datasets, synthesized chain-of-thought reasoning corpora, and curated question-answer pairs from Zhihu.
38
+
39
+ To achieve optimal domain coverage, we meticulously balanced the distribution of various datasets through data mixture optimization experiments. These datasets include [Dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1), [Congliu/Chinese-DeepSeek-R1-Distill-data-110k](https://huggingface.co/datasets/Congliu/Chinese-DeepSeek-R1-Distill-data-110k), [a-m-team/AM-DeepSeek-R1-0528-Distilled](https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-0528-Distilled), alongside high-quality content from Zhihu. All datasets underwent comprehensive quality assurance through our Reward Model (RM) filtering pipeline. To guarantee the model’s foundational knowledge and reasoning capabilities, creative writing data accounted for approximately 23% of the training data, with the remainder consisting of mathematics, code, and fundamental general knowledge data. The chain-of-thought (CoT) reasoning components in the training data was synthesized using [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) and other similar models.
40
+
41
+ The detailed data distribution is shown in the figure below:
42
+
43
+ ![data-distribution](./images/data_distribution.png)
44
+
45
+ <figcaption style="text-align:center; font-size:0.9em; color:#666">
46
+ Figure 1: Training data distribution showing the composition of different data sources, with creative writing data accounting for approximately 23% of the total training corpus, alongside mathematics, code, and general knowledge data.
47
+ </figcaption>
48
+
49
+ ### Training
50
+
51
+ **Supervised Fine-tuning (SFT)**: We employed a curriculum learning strategy for supervised fine-tuning. This methodical approach systematically enhances creative writing capabilities while incorporating diverse domain data to maintain core competencies and mitigate catastrophic forgetting. Adopting a multi-stage progressive iteration method, we select samples that were insufficiently trained in previous rounds and categorize samples by reasoning complexity and context length. This allows us to gradually increase the difficulty of training samples, achieving step-by-step enhancement in model performance.
52
+
53
+ **Direct Preference Optimization (DPO)**: We integrate the RAFT(Reward-Ranked Fine-Tuning) method, combining rule-based systems and LLM-as-judge approaches to identify correct and incorrect samples. This enables the construction of DPO preference sample pairs to address issues such as Chinese-English code-mixing and undesirable repetition in the model, while simultaneously improving its reasoning capabilities.
54
+
55
+ ## 3. Evaluation Results
56
+
57
+ We evaluated our model using WritingBench, a comprehensive framework for assessing large language model writing capabilities. Zhi-Create-Qwen3-32B achieved a score of 82.08 (evaluated with Claude Sonnet 3.7 as the judge), demonstrating significant improvements in creative writing performance. This represents a substantial improvement over the base Qwen3-32B model, which scored 78.97.
58
+
59
+ The performance comparison across six different domains is presented in the figure below:
60
+
61
+ ![writingbench](./images/writingbench_score.png)
62
+
63
+ <figcaption style="text-align:center; font-size:0.9em; color:#666">
64
+ Figure 2: WritingBench performance comparison between Zhi-Create-Qwen3-32B and Qwen3-32B across six domains, evaluated using WritingBench with Claude 3.7 Sonnet as the judge model. The domains encompass: (D1) Academic & Engineering, (D2) Finance & Business, (D3) Politics & Law, (D4) Literature & Art, (D5) Education, and (D6) Advertising & Marketing.
65
+ </figcaption>
66
+
67
+ ## 4. How to Run Locally
68
+
69
+ Zhi-Create-Qwen3-32B can be deployed across various hardware configurations, including 80GB memory GPUs and single H20/A800/H800 units. For more accessible deployment, we offer quantized versions: the FP8 quantized model (Zhi-Create-Qwen3-32B-FP8) can run on dual RTX 4090 setups, while the Q4_K_M quantized version can be deployed on a single RTX 4090.
70
+
71
+ ### Transformers
72
+
73
+ ```python
74
+ from transformers import AutoModelForCausalLM, AutoTokenizer
75
+ from transformers.generation import GenerationConfig
76
+
77
+ MODEL_NAME = "Zhihu-ai/Zhi-Create-Qwen3-32B"
78
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
79
+
80
+ # use bf16
81
+ # model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto", trust_remote_code=True, bf16=True).eval()
82
+ # use fp16
83
+ # model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto", trust_remote_code=True, fp16=True).eval()
84
+ # use cpu only
85
+ # model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="cpu", trust_remote_code=True).eval()
86
+ # use auto mode, automatically select precision based on the device.
87
+ model = AutoModelForCausalLM.from_pretrained(
88
+ MODEL_NAME,
89
+ device_map="auto",
90
+ trust_remote_code=True
91
+ ).eval()
92
+
93
+ # Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
94
+ # model.generation_config = GenerationConfig.from_pretrained(MODEL_NAME, trust_remote_code=True)
95
+
96
+ generate_configs = {
97
+ "temperature": 0.6,
98
+ "do_sample": True,
99
+ "top_p": 0.95,
100
+ "max_new_tokens": 4096
101
+ }
102
+
103
+ prompt = "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章"
104
+ messages = [
105
+ {"role": "user", "content": prompt}
106
+ ]
107
+ text = tokenizer.apply_chat_template(
108
+ messages,
109
+ tokenize=False,
110
+ add_generation_prompt=True
111
+ )
112
+
113
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
114
+
115
+ generated_ids = model.generate(
116
+ **model_inputs,
117
+ **generate_configs
118
+ )
119
+ generated_ids = [
120
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
121
+ ]
122
+
123
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
124
+ print(response)
125
+ ```
126
+
127
+ ### vllm
128
+
129
+ For instance, you can easily start a service using [vLLM](https://github.com/vllm-project/vllm)
130
+
131
+ ```bash
132
+ # install vllm
133
+ pip install vllm>=0.6.4.post1
134
+
135
+ # huggingface model id
136
+ vllm serve Zhihu-ai/Zhi-Create-Qwen3-32B --served-model-name Zhi-Create-Qwen3-32B --port 8000
137
+
138
+ # local path
139
+ vllm serve /path/to/model --served-model-name Zhi-Create-Qwen3-32B --port 8000
140
+
141
+ curl http://localhost:8000/v1/completions \
142
+ -H "Content-Type: application/json" \
143
+ -d '{
144
+ "model": "Zhi-Create-Qwen3-32B",
145
+ "prompt": "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章",
146
+ "max_tokens": 4096,
147
+ "temperature": 0.6,
148
+ "top_p": 0.95
149
+ }'
150
+ ```
151
+
152
+
153
+ ### SGLang
154
+
155
+ You can also easily start a service using [SGLang](https://github.com/sgl-project/sglang)
156
+
157
+ ```bash
158
+ # install SGLang
159
+ pip install "sglang[all]>=0.4.5" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
160
+
161
+ # huggingface model id
162
+ python -m sglang.launch_server --model-path Zhihu-ai/Zhi-Create-Qwen3-32B --served-model-name Zhi-Create-Qwen3-32B --port 8000
163
+
164
+ # local path
165
+ python -m sglang.launch_server --model-path /path/to/model --served-model-name Zhi-Create-Qwen3-32B --port 8000
166
+
167
+ # send request
168
+ curl http://localhost:8000/v1/completions \
169
+ -H "Content-Type: application/json" \
170
+ -d '{
171
+ "model": "Zhi-Create-Qwen3-32B",
172
+ "prompt": "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章",
173
+ "max_tokens": 4096,
174
+ "temperature": 0.6,
175
+ "top_p": 0.95
176
+ }'
177
+
178
+ # Alternative: Using OpenAI API
179
+ from openai import OpenAI
180
+ openai_api_key = "empty"
181
+ openai_api_base = "http://127.0.0.1:8000/v1"
182
+
183
+ client = OpenAI(
184
+ api_key=openai_api_key,
185
+ base_url=openai_api_base
186
+ )
187
+
188
+ def get_answer(messages):
189
+ response = client.chat.completions.create(
190
+ messages=messages,
191
+ model="Zhi-Create-Qwen3-32B",
192
+ max_tokens=4096,
193
+ temperature=0.3,
194
+ top_p=0.95,
195
+ stream=True,
196
+ extra_body = {"chat_template_kwargs": {"enable_thinking": True}}
197
+ )
198
+ answer = ""
199
+ reasoning_content_all = ""
200
+ for each in response:
201
+ each_content = each.choices[0].delta.content
202
+ if hasattr(each.choices[0].delta, "content"):
203
+ each_content = each.choices[0].delta.content
204
+ else:
205
+ each_content = None
206
+ if hasattr(each.choices[0].delta, "reasoning_content"):
207
+ reasoning_content = each.choices[0].delta.reasoning_content
208
+ else:
209
+ reasoning_content = None
210
+ if each_content is not None:
211
+ answer += each_content
212
+ print(each_content, end="", flush=True)
213
+ if reasoning_content is not None:
214
+ reasoning_content_all += reasoning_content
215
+ print(reasoning_content, end="", flush=True)
216
+ return answer, reasoning_content_all
217
+
218
+ prompt = "请你以鲁迅的口吻,写一篇介绍西湖醋鱼的文章"
219
+ messages = [
220
+ {"role": "user", "content": prompt}
221
+ ]
222
+
223
+ answer, reasoning_content_all = get_answer(messages)
224
+ ```
225
+
226
+ ### ollama
227
+
228
+ You can download ollama using [this](https://ollama.com/download/)
229
+
230
+ * quantization: Q4_K_M
231
+
232
+ ```bash
233
+ ollama run zhihu/zhi-create-qwen3-32b
234
+ ```
235
+
236
+ * bf16
237
+
238
+ ```bash
239
+ ollama run zhihu/zhi-create-qwen3-32b:bf16
240
+ ```
241
+
242
+ ## 5. Usage Recommendations
243
+
244
+ For optimal performance, we recommend setting temperature between 0.5-0.7 (0.6 recommended) and top-p to 0.95 for balanced creativity and coherence.
245
+
246
+ ## 6. Citation
247
+
248
+ ```text
249
+ @misc{Zhi-Create-Qwen3-32B,
250
+ title={Zhi-Create-Qwen3-32B: RAFT-Enhanced Direct Preference Optimization and Curriculum Learning for Robust Creative Writing in LLMs},
251
+ author={Jiewu Wang, Xu Chen, Wenyuan Su, Chao Huang, Hongkui Gao, Lin Feng, Shan Wang, Jingjing Wang, Zebin Ou},
252
+ year={2025},
253
+ eprint={},
254
+ archivePrefix={},
255
+ url={https://huggingface.co/Zhihu-ai/Zhi-Create-Qwen3-32B},
256
+ }
257
+ ```
258
+
259
+ ## 7. Contact
260
+
261
+ If you have any questions, please raise an issue or contact us at [ai@zhihu.com](mailto:ai@zhihu.com).
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLMEagle3"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 151643,
8
+ "draft_vocab_size": 32000,
9
+ "eos_token_id": 151645,
10
+ "head_dim": 128,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 5120,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 25600,
15
+ "max_position_embeddings": 40960,
16
+ "max_window_layers": 64,
17
+ "mlp_bias": false,
18
+ "model_type": "llama",
19
+ "num_attention_heads": 64,
20
+ "num_hidden_layers": 1,
21
+ "num_key_value_heads": 8,
22
+ "pretraining_tp": 1,
23
+ "rms_norm_eps": 1e-06,
24
+ "rope_scaling": null,
25
+ "rope_theta": 1000000,
26
+ "sliding_window": null,
27
+ "tie_word_embeddings": false,
28
+ "torch_dtype": "bfloat16",
29
+ "transformers_version": "4.53.2",
30
+ "use_cache": true,
31
+ "use_sliding_window": false,
32
+ "vocab_size": 151936
33
+ }
images/data_distribution.png ADDED
images/writingbench_score.png ADDED

Git LFS Details

  • SHA256: 1d6feb3f54dbad00c9c24cf3f3b485e8534b443b55cdcbf20113ea978c51ac36
  • Pointer size: 131 Bytes
  • Size of remote file: 128 kB
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:81f91df26b525a30110e6538e33efbd71f820b9cda8070104bbdf7234bd18994
3
+ size 3121274856