Abstract
RLVR is extended to open-ended tasks using rubric-based rewards, improving performance on benchmarks and providing stylistic control in LLMs.
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.
Community
SOTA RL-trained model on open-ended tasks
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains (2025)
- URPO: A Unified Reward & Policy Optimization Framework for Large Language Models (2025)
- RLPR: Extrapolating RLVR to General Domains without Verifiers (2025)
- Posterior-GRPO: Rewarding Reasoning Processes in Code Generation (2025)
- RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback (2025)
- Libra: Assessing and Improving Reward Model by Learning to Think (2025)
- CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper