📖Introduction

Github

LUFFY is a reinforcement learning framework that bridges the gap between zero-RL and imitation learning by incorporating off-policy reasoning traces into the training process. Built upon GRPO, LUFFY combines on-policy rollouts with off-policy demonstrations during advantage estimation and introduces policy shaping via regularized importance sampling to emphasize low-probability yet crucial actions.

Key Highlights:

  • Off-Policy Guidance: Seamlessly integrates external reasoning traces to bootstrap learning from stronger models.
  • Dynamic Balance: Learns when to imitate and when to explore, adapting over the course of training.
  • Policy Shaping: Emphasizes important actions often ignored in standard policy gradients, enabling better generalization.

Inference

Here’s an example of using LUFFY for inference:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_path="Elliott/LUFFY-Qwen-Math-7B-Zero"

question = "which number is larger? 9.11 or 9.9?"

tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [{"role": "user", "content": question}]
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_path)
params = SamplingParams(temperature=0.6, max_tokens=8192)
outputs = llm.generate([chat], params)
print(outputs[0].outputs[0].text)

📃Evaluation

LUFFY is evaluated on six competition-level benchmarks, achieving state-of-the-art results among all zero-RL methods. It surpasses both on-policy RL and imitation learning (SFT), especially in generalization:

Model AIME 2024 AIME 2025 AMC MATH-500 Minerva Olympiad Avg.
Qwen2.5-Math 12.9 4.2 32.6 48.8 10.7 14.8 20.7
Qwen2.5-Math-Instruct 11.4 8.8 48.3 81.2 33.1 38.8 36.9
SimpleRL-Zero 26.3 6.7 55.4 74.4 25.7 35.4 37.3
OpenReasoner-Zero 17.2 15.0 52.3 84.6 33.8 47.1 41.7
PRIME-Zero 17.9 14.7 55.2 79.4 38.2 42.2 41.3
Oat-Zero 31.7 11.0 61.6 79.2 29.8 42.5 42.6
LUFFY 29.5 23.2 66.1 88.4 33.8 56.4 49.6

LUFFY also generalizes well to out-of-distribution tasks, with over +6.2 average gain on ARC-C, GPQA, and MMLU-Pro.

Model ARC-c GPQA-diamond MMLU-Pro Avg.
Qwen2.5-Math-7B-Base 18.2 11.1 16.9 15.4
Qwen2.5-Math-7B-Instruct 70.3 24.7 34.1 43.0
SimpleRL-Zero 30.2 23.2 34.5 29.3
OpenReasoner-Zero 66.2 29.8 58.7 51.6
PRIME-Zero 73.3 18.2 32.7 41.4
Oat-Zero 70.1 23.7 41.7 45.2
LUFFY 80.5 39.9 53.0 57.8

🌻Acknowledgement

LUFFY builds upon veRL and deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for math reasoning evaluation. We thank the open-source community for datasets and backbones, including NuminaMath, OpenR1-Math-220k, Qwen2.5-Math, and DeepSeek-R1 model.

Citation

If you find our model, data, or evaluation code useful, please kindly cite our paper:

@misc{luffy,
      title={Learning to Reason under Off-Policy Guidance}, 
      author={Jianhao Yan and Yafu Li and Zican Hu and Zhi Wang and Ganqu Cui and Xiaoye Qu and Yu Cheng and Yue Zhang},
      year={2025},
      eprint={2504.14945},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.14945}, 
}
Downloads last month
16
Safetensors
Model size
7.62B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Elliott/LUFFY-Qwen-Math-7B-Zero

Base model

Qwen/Qwen2.5-7B
Finetuned
(174)
this model

Collection including Elliott/LUFFY-Qwen-Math-7B-Zero