open-r1
/

OpenR1-Distill-7B

@@ -1,7 +1,7 @@
 ---
 license: apache-2.0
 datasets:
-- open-r1/Mixture-of-Reasons
 language:
 - en
 base_model:
@@ -69,7 +69,7 @@ print(outputs[0]["generated_text"])
 We use [Lighteval](https://github.com/huggingface/lighteval) to evaluate models on the following benchmarks:
-| Model                       | AIME 2024 | MATH-500 | GPQA-D | LiveCodeBench |
 |-----------------------------|-----------|----------|--------|---------------|
 | OpenR1-Distill-7B           | 52.7      | 89.0     | 52.8   | 39.4          |
 | DeepSeek-R1-Distill-Qwen-7B | 51.3      | 93.5     | 52.4   | 37.4          |
@@ -87,17 +87,20 @@ Note that for benchmarks like AIME 2024, it is important to sample many response
 ## Training methodology
-OpenR1-Distill-7B was trained using supervised fine-tuning (SFT) on the [Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) dataset, which contains reasoning traces distilled from [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1). To optimise the data mixture, we followed the same methodology described in the [Phi-4-reasoning tech report](https://huggingface.co/papers/2504.21318), namely that mixtures can be optimised independently per domain, and then combined into a single dataset. The figure below shows evolution of our experiments on the math and code domains:
 <img src="data_mixture.png" alt="Centered Image" style="display: block; margin: 0 auto;">
 The individual experiments correspond to the following:
-* exp1 - exp3: extending the model's base RoPE frequency from 10k to 100k, 200k, and 300k respectively.
-* exp4 - exp6: scaling the learning rate on the math and code mixtures from 1e-5 to 2e-5, and 4e-5 respectively.
 * exp7 - exp8: measuring the impact of sequence packing (exp7) versus no packing (exp8) on the math mixture.
 * exp9 - exp10: measuring the impact of training on all three mixtures (math, code, and science) versus training on math and code only.
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -117,7 +120,7 @@ The following hyperparameters were used during training:
 ### Training results
-During training, we monitor progress on AIME 2024, GPQA Diamond, and LiveCodeBench v4 every epoch. We use LiveCodeBench v4 to accelerate evaluation as it contains fewer problems than v5, yet is still representative of the full benchmark. The following plot shows the training results:
 <img src="train_results.png" alt="Centered Image" style="display: block; margin: 0 auto;">

 ---
 license: apache-2.0
 datasets:
+- open-r1/Mixture-of-Thoughts
 language:
 - en
 base_model:
 We use [Lighteval](https://github.com/huggingface/lighteval) to evaluate models on the following benchmarks:
+| Model                       | AIME 2024 | MATH-500 | GPQA Diamond | LiveCodeBench v5 |
 |-----------------------------|-----------|----------|--------|---------------|
 | OpenR1-Distill-7B           | 52.7      | 89.0     | 52.8   | 39.4          |
 | DeepSeek-R1-Distill-Qwen-7B | 51.3      | 93.5     | 52.4   | 37.4          |
 ## Training methodology
+OpenR1-Distill-7B was trained using supervised fine-tuning (SFT) on the [Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) dataset, which contains 350k reasoning traces distilled from [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1). To optimise the data mixture, we followed the same methodology described in the [Phi-4-reasoning tech report](https://huggingface.co/papers/2504.21318), namely that mixtures can be optimised independently per domain, and then combined into a single dataset. The figure below shows the evolution of our experiments on the math and code domains:
 <img src="data_mixture.png" alt="Centered Image" style="display: block; margin: 0 auto;">
 The individual experiments correspond to the following:
+* exp1 - exp3: extending the model's base RoPE frequency from 10k to 100k, 300k, and 500k respectively. We find there is no significant difference between the scaling factors, and choose 300k in all subsequent experiments.
+* exp4 - exp6: independently scaling the learning rate on the math and code mixtures from 1e-5 to 2e-5, and 4e-5 respectively.
 * exp7 - exp8: measuring the impact of sequence packing (exp7) versus no packing (exp8) on the math mixture.
 * exp9 - exp10: measuring the impact of training on all three mixtures (math, code, and science) versus training on math and code only.
+> [!NOTE]
+> We use LiveCodeBench v4 to accelerate evaluation during our ablations as it contains around half the problems of v5, yet is still representative of the full benchmark.
 ### Training hyperparameters
 The following hyperparameters were used during training:
 ### Training results
+During training, we monitor progress on AIME 2024, GPQA Diamond, and LiveCodeBench v4 every epoch. The following plot shows the training results:
 <img src="train_results.png" alt="Centered Image" style="display: block; margin: 0 auto;">