lewtun HF Staff commited on
Commit
1a5f577
·
verified ·
1 Parent(s): 911eec1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -6
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  license: apache-2.0
3
  datasets:
4
- - open-r1/Mixture-of-Reasons
5
  language:
6
  - en
7
  base_model:
@@ -69,7 +69,7 @@ print(outputs[0]["generated_text"])
69
 
70
  We use [Lighteval](https://github.com/huggingface/lighteval) to evaluate models on the following benchmarks:
71
 
72
- | Model | AIME 2024 | MATH-500 | GPQA-D | LiveCodeBench |
73
  |-----------------------------|-----------|----------|--------|---------------|
74
  | OpenR1-Distill-7B | 52.7 | 89.0 | 52.8 | 39.4 |
75
  | DeepSeek-R1-Distill-Qwen-7B | 51.3 | 93.5 | 52.4 | 37.4 |
@@ -87,17 +87,20 @@ Note that for benchmarks like AIME 2024, it is important to sample many response
87
 
88
  ## Training methodology
89
 
90
- OpenR1-Distill-7B was trained using supervised fine-tuning (SFT) on the [Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) dataset, which contains reasoning traces distilled from [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1). To optimise the data mixture, we followed the same methodology described in the [Phi-4-reasoning tech report](https://huggingface.co/papers/2504.21318), namely that mixtures can be optimised independently per domain, and then combined into a single dataset. The figure below shows evolution of our experiments on the math and code domains:
91
 
92
  <img src="data_mixture.png" alt="Centered Image" style="display: block; margin: 0 auto;">
93
 
94
  The individual experiments correspond to the following:
95
 
96
- * exp1 - exp3: extending the model's base RoPE frequency from 10k to 100k, 200k, and 300k respectively.
97
- * exp4 - exp6: scaling the learning rate on the math and code mixtures from 1e-5 to 2e-5, and 4e-5 respectively.
98
  * exp7 - exp8: measuring the impact of sequence packing (exp7) versus no packing (exp8) on the math mixture.
99
  * exp9 - exp10: measuring the impact of training on all three mixtures (math, code, and science) versus training on math and code only.
100
 
 
 
 
101
  ### Training hyperparameters
102
 
103
  The following hyperparameters were used during training:
@@ -117,7 +120,7 @@ The following hyperparameters were used during training:
117
 
118
  ### Training results
119
 
120
- During training, we monitor progress on AIME 2024, GPQA Diamond, and LiveCodeBench v4 every epoch. We use LiveCodeBench v4 to accelerate evaluation as it contains fewer problems than v5, yet is still representative of the full benchmark. The following plot shows the training results:
121
 
122
  <img src="train_results.png" alt="Centered Image" style="display: block; margin: 0 auto;">
123
 
 
1
  ---
2
  license: apache-2.0
3
  datasets:
4
+ - open-r1/Mixture-of-Thoughts
5
  language:
6
  - en
7
  base_model:
 
69
 
70
  We use [Lighteval](https://github.com/huggingface/lighteval) to evaluate models on the following benchmarks:
71
 
72
+ | Model | AIME 2024 | MATH-500 | GPQA Diamond | LiveCodeBench v5 |
73
  |-----------------------------|-----------|----------|--------|---------------|
74
  | OpenR1-Distill-7B | 52.7 | 89.0 | 52.8 | 39.4 |
75
  | DeepSeek-R1-Distill-Qwen-7B | 51.3 | 93.5 | 52.4 | 37.4 |
 
87
 
88
  ## Training methodology
89
 
90
+ OpenR1-Distill-7B was trained using supervised fine-tuning (SFT) on the [Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) dataset, which contains 350k reasoning traces distilled from [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1). To optimise the data mixture, we followed the same methodology described in the [Phi-4-reasoning tech report](https://huggingface.co/papers/2504.21318), namely that mixtures can be optimised independently per domain, and then combined into a single dataset. The figure below shows the evolution of our experiments on the math and code domains:
91
 
92
  <img src="data_mixture.png" alt="Centered Image" style="display: block; margin: 0 auto;">
93
 
94
  The individual experiments correspond to the following:
95
 
96
+ * exp1 - exp3: extending the model's base RoPE frequency from 10k to 100k, 300k, and 500k respectively. We find there is no significant difference between the scaling factors, and choose 300k in all subsequent experiments.
97
+ * exp4 - exp6: independently scaling the learning rate on the math and code mixtures from 1e-5 to 2e-5, and 4e-5 respectively.
98
  * exp7 - exp8: measuring the impact of sequence packing (exp7) versus no packing (exp8) on the math mixture.
99
  * exp9 - exp10: measuring the impact of training on all three mixtures (math, code, and science) versus training on math and code only.
100
 
101
+ > [!NOTE]
102
+ > We use LiveCodeBench v4 to accelerate evaluation during our ablations as it contains around half the problems of v5, yet is still representative of the full benchmark.
103
+
104
  ### Training hyperparameters
105
 
106
  The following hyperparameters were used during training:
 
120
 
121
  ### Training results
122
 
123
+ During training, we monitor progress on AIME 2024, GPQA Diamond, and LiveCodeBench v4 every epoch. The following plot shows the training results:
124
 
125
  <img src="train_results.png" alt="Centered Image" style="display: block; margin: 0 auto;">
126