|
--- |
|
base_model: Qwen/Qwen2.5-1.5B |
|
datasets: |
|
- math |
|
language: |
|
- en |
|
license: apache-2.0 |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
--- |
|
|
|
# Qwen2.5-1.5B-Intuitor-MATH-1EPOCH |
|
|
|
An Intuitor-fine-tuned version of Qwen2.5-1.5B trained on the MATH dataset. |
|
|
|
This model is part of the work presented in the paper [**Learning to Reason without External Rewards**](https://huggingface.co/papers/2505.19590). |
|
|
|
## Abstract |
|
|
|
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. |
|
|
|
## Overview |
|
|
|
**Intuitor** is a reinforcement learning method that fine-tunes large language models (LLMs) using *self-certainty*—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm we call **Reinforcement Learning from Internal Feedback (RLIF)**. |
|
|
|
<p align="center"> |
|
<img src="https://raw.githubusercontent.com/sunblaze-ucb/rlif/main/figs/rlif.png" alt="RLIF Overview" width="700"/> |
|
</p> |
|
|
|
### 🧭 What is RLIF? |
|
|
|
**Reinforcement Learning from Internal Feedback (RLIF)** is a training framework where language models learn *without any external rewards, gold labels, or verifiers*. Instead, models improve by optimizing *intrinsic signals*—such as confidence in their own answers—generated entirely from within. RLIF enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable. |
|
|
|
Intuitor instantiates RLIF by using **self-certainty**—a model's confidence measured via KL divergence to uniform—as an intrinsic reward in the GRPO policy optimization algorithm. |
|
|
|
<p align="center"> |
|
<img src="https://raw.githubusercontent.com/sunblaze-ucb/rlif/main/figs/intuitor.png" alt="Intuitor" width="700"/> |
|
</p> |
|
|
|
## Code |
|
|
|
The official code for "Learning to Reason without External Rewards" and the Intuitor framework is available on the [GitHub repository](https://github.com/sunblaze-ucb/rlif). |
|
|
|
## Usage |
|
|
|
This model can be loaded and used directly with the Hugging Face `transformers` library. Below is a basic example for text generation using the Qwen2.5 chat template: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
model_id = "sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH" |
|
|
|
# Load tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported by your GPU |
|
device_map="auto" |
|
) |
|
model.eval() # Set model to evaluation mode |
|
|
|
# Define a conversation using the Qwen2.5 chat template |
|
messages = [ |
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
{"role": "user", "content": "Solve the following math problem: What is the sum of the first 10 prime numbers?"} |
|
] |
|
|
|
# Apply chat template to get the prompt string |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
|
|
# Tokenize the input and move to device |
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
# Generate output |
|
with torch.no_grad(): |
|
generated_ids = model.generate( |
|
model_inputs.input_ids, |
|
max_new_tokens=256, |
|
do_sample=False, # For deterministic output |
|
temperature=0.1, # Low temperature for more deterministic output |
|
pad_token_id=tokenizer.eos_token_id # Important for Qwen2.5 |
|
) |
|
|
|
# Decode the generated text, excluding the input prompt |
|
generated_text = tokenizer.batch_decode(generated_ids[:, model_inputs.input_ids.shape[1]:], skip_special_tokens=True)[0] |
|
print(generated_text) |
|
``` |
|
|
|
## Benchmarks |
|
|
|
Intuitor achieves: |
|
|
|
* Comparable performance to GRPO on in-domain math reasoning tasks (GSM8K, MATH500). |
|
* Superior generalization to code generation (LiveCodeBench, CRUXEval). |
|
* Improved instruction following, without needing any gold labels or verifiable test suites. |
|
|
|
For detailed results, see Table 1 in the paper. |
|
|
|
| Model Name | Size | Method | Hugging Face Link | |
|
| :--------- | :--- | :----- | :---------------- | |
|
| `sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH` | 1.5B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH) | |
|
| `sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH` | 3B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH) | |
|
| `sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH` | 7B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH) | |
|
| `sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH` | 14B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH) | |
|
| `sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH` | 1.5B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH) | |
|
| `sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH` | 3B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH) | |
|
| `sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH` | 7B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH) | |
|
| `sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH` | 14B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH) | |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{zhao2025learning, |
|
title = {Learning to Reason without External Rewards}, |
|
author = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn}, |
|
journal = {arXiv preprint arXiv:2505.19590}, |
|
year = {2025} |
|
} |
|
``` |