Text Generation
Transformers
Safetensors
English
qwen2
conversational
text-generation-inference
michaelzhiluo commited on
Commit
7a8188b
·
verified ·
1 Parent(s): 906e60f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -115
README.md CHANGED
@@ -1,115 +1,114 @@
1
- ---
2
- license: mit
3
- library_name: transformers
4
- datasets:
5
- - AI-MO/NuminaMath-CoT
6
- - KbsdJames/Omni-MATH
7
- - RUC-AIBOX/STILL-3-Preview-RL-Data
8
- - hendrycks/competition_math
9
- language:
10
- - en
11
- base_model:
12
- - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
13
- pipeline_tag: text-generation
14
- ---
15
-
16
- <div align="center">
17
- <span style="font-family: default; font-size: 1.5em;">DeepScaleR-1.5B-Preview</span>
18
- <div>
19
- 🚀 Democratizing Reinforcement Learning for LLMs 🌟
20
- </div>
21
- </div>
22
- <br>
23
- <div align="center" style="line-height: 1;">
24
- <a href="https://github.com/agentica-project/deepscaler" style="margin: 2px;">
25
- <img alt="Code" src="https://img.shields.io/badge/DeepScaleR-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
26
- </a>
27
- <a href="https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2" target="_blank" style="margin: 2px;">
28
- <img alt="Blog" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
29
- </a>
30
- <a href="https://x.com/Agentica_/status/1889006266661617779" style="margin: 2px;">
31
- <img alt="X.ai" src="https://img.shields.io/badge/Agentica-white?style=for-the-badge&logo=X&logoColor=000&color=000&labelColor=white" style="display: inline-block; vertical-align: middle;"/>
32
- </a>
33
- <a href="https://huggingface.co/agentica-org" style="margin: 2px;">
34
- <img alt="Hugging Face" src="https://img.shields.io/badge/Agentica-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
35
- </a>
36
- </div>
37
- </div>
38
- </div>
39
-
40
- ## DeepScaleR Overview
41
- DeepScaleR-1.5B-Preview is a language model fine-tuned from DeepSeek-R1-Distilled-Qwen-1.5B using distributed reinforcement learning (RL) to scale up to long context lengths. The model achieves 43.1% Pass@1 accuracy on AIME 2024, representing a 15% improvement over the base model (28.8%) and surpassing OpenAI's O1-Preview performance with just 1.5B parameters.
42
-
43
- ## Data
44
- Our training dataset consists of approximately 40,000 unique problem-answer pairs compiled from:
45
- - AIME problems (1984-2023)
46
- - AMC problems (prior to 2023)
47
- - Omni-MATH dataset
48
- - Still dataset
49
-
50
- ## Training Recipe
51
- We employ Deepseek's Group Relative Policy Optimization (GRPO), a simplified RL algorithm that extends PPO by:
52
- - Normalizing advantage function over all samples generated from the same prompt.
53
- - Applying KL divergence regularization on top of PPO's surrogate loss to prevent significant policy drift.
54
-
55
- **Reward Function**: Our reward function is simple but effective:
56
- - 1 for correct answers passing LaTeX/Sympy checks
57
- - 0 for incorrect or improperly formatted answers
58
- - Note: No partial rewards (such as PRMs) or intermediate feedback.
59
-
60
- **Iterative Context Lengthening**: A key challenge in scaling RL for reasoning is compute cost. Our approach trains models with progressively longer contexts as the model improves, thus saving monetary costs and end2end training time:
61
- - Initial 8K Context (0-1040 steps):
62
- - 22.9% -> 33% Pass@1 on AIME 2024
63
- - Trained on 8 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 8 = 1024
64
- - Extended to 16K (steps 1040-1520):
65
- - 33% -> 43% Pass@1 on AIME 2024
66
- - Trained on 32 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 16 = 2048
67
- - Further extended to 24K (step 1520+):
68
- - 38% -> 43% Pass@1 on AIME 2024
69
- - Trained on 32 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 16 = 2048
70
- - Significant improvements within <200 steps
71
-
72
- A more detailed description of the training recipe can be found in our [blog post](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2).
73
-
74
- ## Evaluation
75
- We report Pass@1 accuracy averaged over 16 samples for each problem.
76
- | Model | AIME 2024 | MATH 500 | AMC 2023 | Minerva Math | OlympiadBench | Avg. |
77
- |-------|-----------|-----------|-----------|--------------|---------------|------|
78
- | Qwen-2.5-7B-Instruct | 13.3 | 79.8 | 50.6 | 34.6 | 40.7 | 43.8 |
79
- | rStar-Math-7B | 26.7 | 78.4 | 47.5 | - | 47.1 | - |
80
- | Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
81
- | Qwen2.5-7B-SimpleRL | 26.7 | 82.4 | 62.5 | <strong>39.7</strong> | 43.3 | 50.9 |
82
- | DeepSeek-R1-Distill-Qwen-1.5B | 28.8 | 82.8 | 62.9 | 26.5 | 43.3 | 48.9 |
83
- | Still-1.5B | 32.5 | 84.4 | 66.7 | 29.0 | 45.4 | 51.6 |
84
- | <strong>DeepScaleR-1.5B-Preview</strong> | <strong>43.1</strong> | <strong>87.8</strong> | <strong>73.6</strong> | 30.2 | <strong>50.0</strong> | <strong>57.0</strong> |
85
- | O1-Preview | 40.0 | 81.4 | - | - | - | - |
86
-
87
- ## Serving DeepScaleR
88
- Our model can be served using popular high-performance inference systems:
89
- - vLLM
90
- - Hugging Face Text Generation Inference (TGI)
91
- - SGLang
92
- - TensorRT-LLM
93
-
94
- All these systems support the OpenAI Chat Completions API format.
95
-
96
- ## License
97
- This project is released under the MIT License, reflecting our commitment to open and accessible AI development.
98
- We believe in democratizing AI technology by making our work freely available for anyone to use, modify, and build upon.
99
- This permissive license ensures that researchers, developers, and enthusiasts worldwide can leverage and extend our work without restrictions, fostering innovation and collaboration in the AI community.
100
-
101
- ## Acknowledgement
102
- - Our training experiments are powered by our heavily modified fork of [Verl](https://github.com/agentica-project/verl), an open-source RLHF library.
103
- - Our model is trained on top of [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
104
- - Our work is done as part of [Berkeley Sky Computing Lab](https://skycomputing.berkeley.edu/) and [Berkeley AI Research](https://bair.berkeley.edu/).
105
-
106
- ## Citation
107
- ```bibtex
108
- @misc{deepscaler2025,
109
- title={DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL},
110
- author={Michael Luo and Sijun Tan and Justin Wong and Xiaoxiang Shi and William Y. Tang and Manan Roongta and Colin Cai and Jeffrey Luo and Tianjun Zhang and Li Erran Li and Raluca Ada Popa and Ion Stoica},
111
- year={2025},
112
- howpublished={\url{https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2}},
113
- note={Notion Blog}
114
- year={2025}
115
- }
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ datasets:
5
+ - PrimeIntellect/verifiable-coding-problems
6
+ - likaixin/TACO-verified
7
+ - livecodebench/code_generation_lite
8
+ language:
9
+ - en
10
+ base_model:
11
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
12
+ pipeline_tag: text-generation
13
+ ---
14
+
15
+ <div align="center">
16
+ <span style="font-family: default; font-size: 1.5em;">DeepCoder-14B-Preview</span>
17
+ <div>
18
+ 🚀 Democratizing Reinforcement Learning for LLMs (RLLM) 🌟
19
+ </div>
20
+ </div>
21
+ <br>
22
+ <div align="center" style="line-height: 1;">
23
+ <a href="https://github.com/agentica-project/rllm" style="margin: 2px;">
24
+ <img alt="Code" src="https://img.shields.io/badge/RLLM-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
25
+ </a>
26
+ <a href="https://www.google.com" target="_blank" style="margin: 2px;">
27
+ <img alt="Blog" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
28
+ </a>
29
+ <a href="https://x.com/Agentica_" style="margin: 2px;">
30
+ <img alt="X.ai" src="https://img.shields.io/badge/Agentica-white?style=for-the-badge&logo=X&logoColor=000&color=000&labelColor=white" style="display: inline-block; vertical-align: middle;"/>
31
+ </a>
32
+ <a href="https://huggingface.co/agentica-org" style="margin: 2px;">
33
+ <img alt="Hugging Face" src="https://img.shields.io/badge/Agentica-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
34
+ </a>
35
+ </div>
36
+ </div>
37
+ </div>
38
+
39
+ ## DeepCoder Overview
40
+ DeepCoder-14B-Preview is a code reasoning LLM fine-tuned from DeepSeek-R1-Distilled-Qwen-14B using distributed reinforcement learning (RL) to scale up to long context lengths. The model achieves 60.6% Pass@1 accuracy on LiveCodeBench v5 (8/1/24-2/1/25), representing a 8% improvement over the base model (53%) and achieving similar performance to OpenAI's o3-mini with just 14B parameters.
41
+
42
+ <div style="width: 50%;margin: 0 auto;">
43
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/654037be97949fd2304aab7f/r3-vzkItOCrMf1qldW0Mj.png" style="width: 100%;" />
44
+ </div>
45
+
46
+ ## Data
47
+ Our training dataset consists of approximately 24K unique problem-tests pairs compiled from:
48
+ - Taco-Verified
49
+ - PrimeIntellect SYNTHETIC-1
50
+ - LiveCodeBench v5 (5/1/23-7/31/24)
51
+
52
+ ## Training Recipe
53
+
54
+ Our training recipe relies on an improved version of GRPO (GRPO+) and iterative context lengthening, introduced in DeepScaleR.
55
+
56
+ ### GRPO+
57
+
58
+ We enhance the original GRPO algorithm with insights from DAPO to enable more stable training:
59
+
60
+ - **Offline Difficulty Filtering:** DAPO employs online dynamic sampling, discarding both entirely correct and entirely incorrect samples on the fly. While this helps maintain a more stable effective batch size, it introduces significant runtime overhead due to rejection sampling. Instead, we perform offline difficulty filtering on a subset of coding problems to ensure the training dataset remains within a suitable difficulty range.
61
+ - **No Entropy Loss:** We observed that including an entropy loss term often led to instability, with entropy growing exponentially and ultimately collapsing training. To mitigate this, we eliminate the entropy loss entirely.
62
+ - **No KL Loss:** Eliminating KL loss prevents the LLM from staying within trust region of the original SFT model. This removal also obviates the need to compute log probabilities for the reference policy, thereby accelerating training.
63
+ - **Overlong Filtering** **(from DAPO):** To preserve long-context reasoning, we mask the loss for truncated sequences. This technique enables DeepCoder to generalize to 64K-context inference despite being trained with a 32K context.
64
+ - **Clip High (from DAPO):** By increasing the upper bound in GRPO/PPO’s surrogate loss, we encourage more exploration and more stable entropy.
65
+
66
+ ### Iterative Context Lengthening
67
+
68
+ Our original `Deepscaler-1.5B-Preview` scaled long context training from 8K→16K→24K, achieving 33→38→43% on AIME respectively. Similarly, `Deepcoder-14B-Preview` is trained on 16K→32K, achieving 54→58% on LiveCodeBench (v5). `DeepCoder-14B-Preview` successfully generalizes to longer contexts when evaluated at 64K context, reaching 60.6%.
69
+
70
+ DeepCoder generalizes better to long contexts than the base distilled model, due to DAPO's overlong filtering. However, it's longer responses are often truncated when the max length is capped at 16K, which can lower its scores.
71
+
72
+ | **Model** | **16K** | **32K** | **64K** |
73
+ | --- | --- | --- | --- |
74
+ | **DeepCoder-14B-Preview** | 45.6 | 57.9 | 60.6 |
75
+ | **DeepSeek-R1-Distill-Qwen-14B** | 50.2 | 53.0 | 53.0 |
76
+
77
+ A more detailed description of the training recipe can be found in our [blog post](https://www.google.com).
78
+
79
+ ## Evaluation
80
+
81
+ We evaluate `Deepcoder-14B-Preview` on various coding benchmarks, including LiveCodeBench (LCBv5), Codeforces, and HumanEval+.
82
+
83
+ | **Model** | LCB (v5)(8/1/24-2/1/25) | Codeforces Rating | Codeforces Percentile | HumanEval+ |
84
+ | --- | --- | --- | --- | --- |
85
+ | **DeepCoder-14B-Preview (ours)** | ***60.6*** | ***1936*** | ***95.3*** | ***92.6*** |
86
+ | **DeepSeek-R1-Distill-Qwen-14B** | 53.0 | 1791 | 92.7 | 92.0 |
87
+ | **O1-2024-12-17 (Low)** | 59.5 | **1991** | **96.1** | 90.8 |
88
+ | **O3-Mini-2025-1-31 (Low)** | **60.9** | 1918 | 94.9 | 92.6 |
89
+ | **O1-Preview** | 42.7 | 1658 | 88.5 | 89 |
90
+ | **Deepseek-R1** | 62.8 | 1948 | 95.4 | 92.6 |
91
+ | **Llama-4-Behemoth** | 49.4 | - | - | - |
92
+
93
+ ## Serving DeepCoder
94
+ Our model can be served using popular high-performance inference systems:
95
+ - vLLM
96
+ - Hugging Face Text Generation Inference (TGI)
97
+ - SGLang
98
+ - TensorRT-LLM
99
+
100
+ All these systems support the OpenAI Chat Completions API format.
101
+
102
+ ## License
103
+ This project is released under the MIT License, reflecting our commitment to open and accessible AI development.
104
+ We believe in democratizing AI technology by making our work freely available for anyone to use, modify, and build upon.
105
+ This permissive license ensures that researchers, developers, and enthusiasts worldwide can leverage and extend our work without restrictions, fostering innovation and collaboration in the AI community.
106
+
107
+ ## Acknowledgement
108
+ - Our training experiments are powered by our heavily modified fork of [Verl](https://github.com/agentica-project/verl), an open-source post-training library.
109
+ - Our model is trained on top of [`DeepSeek-R1-Distill-Qwen-14B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B).
110
+ - Our work is done as part of [Berkeley Sky Computing Lab](https://skycomputing.berkeley.edu/) and [Berkeley AI Research](https://bair.berkeley.edu/).
111
+
112
+ ## Citation
113
+ ```bibtex
114
+ ```