Qwen2.5-1.5B-Intuitor-MATH-1EPOCH

An Intuitor-fine-tuned version of Qwen2.5-1.5B trained on the MATH dataset.

This model is part of the work presented in the paper Learning to Reason without External Rewards.

Abstract

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable.

Overview

Intuitor is a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm we call Reinforcement Learning from Internal Feedback (RLIF).

RLIF Overview

🧭 What is RLIF?

Reinforcement Learning from Internal Feedback (RLIF) is a training framework where language models learn without any external rewards, gold labels, or verifiers. Instead, models improve by optimizing intrinsic signals—such as confidence in their own answers—generated entirely from within. RLIF enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable.

Intuitor instantiates RLIF by using self-certainty—a model's confidence measured via KL divergence to uniform—as an intrinsic reward in the GRPO policy optimization algorithm.

Intuitor

Code

The official code for "Learning to Reason without External Rewards" and the Intuitor framework is available on the GitHub repository.

Usage

This model can be loaded and used directly with the Hugging Face transformers library. Below is a basic example for text generation using the Qwen2.5 chat template:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported by your GPU
    device_map="auto"
)
model.eval() # Set model to evaluation mode

# Define a conversation using the Qwen2.5 chat template
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Solve the following math problem: What is the sum of the first 10 prime numbers?"}
]

# Apply chat template to get the prompt string
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize the input and move to device
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate output
with torch.no_grad():
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=256,
        do_sample=False, # For deterministic output
        temperature=0.1, # Low temperature for more deterministic output
        pad_token_id=tokenizer.eos_token_id # Important for Qwen2.5
    )

# Decode the generated text, excluding the input prompt
generated_text = tokenizer.batch_decode(generated_ids[:, model_inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(generated_text)

Benchmarks

Intuitor achieves:

Comparable performance to GRPO on in-domain math reasoning tasks (GSM8K, MATH500).
Superior generalization to code generation (LiveCodeBench, CRUXEval).
Improved instruction following, without needing any gold labels or verifiable test suites.

For detailed results, see Table 1 in the paper.

Model Name	Size	Method	Hugging Face Link
`sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH`	1.5B	Intuitor	View Model
`sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH`	3B	Intuitor	View Model
`sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH`	7B	Intuitor	View Model
`sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH`	14B	Intuitor	View Model
`sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH`	1.5B	GRPO	View Model
`sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH`	3B	GRPO	View Model
`sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH`	7B	GRPO	View Model
`sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH`	14B	GRPO	View Model

Citation

@article{zhao2025learning,
  title   = {Learning to Reason without External Rewards},
  author  = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal = {arXiv preprint arXiv:2505.19590},
  year    = {2025}
}

sunblaze-ucb
/

Qwen2.5-1.5B-Intuitor-MATH-1EPOCH