Qwen2.5-1.5B-Intuitor-MATH-1EPOCH

An Intuitor-fine-tuned version of Qwen2.5-1.5B trained on the MATH dataset.

This model is part of the work presented in the paper Learning to Reason without External Rewards.

Abstract

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable.

Overview

Intuitor is a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm we call Reinforcement Learning from Internal Feedback (RLIF).

RLIF Overview

🧭 What is RLIF?

Reinforcement Learning from Internal Feedback (RLIF) is a training framework where language models learn without any external rewards, gold labels, or verifiers. Instead, models improve by optimizing intrinsic signals—such as confidence in their own answers—generated entirely from within. RLIF enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable.

Intuitor instantiates RLIF by using self-certainty—a model's confidence measured via KL divergence to uniform—as an intrinsic reward in the GRPO policy optimization algorithm.

Intuitor

Code

The official code for "Learning to Reason without External Rewards" and the Intuitor framework is available on the GitHub repository.

Usage

This model can be loaded and used directly with the Hugging Face transformers library. Below is a basic example for text generation using the Qwen2.5 chat template:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported by your GPU
    device_map="auto"
)
model.eval() # Set model to evaluation mode

# Define a conversation using the Qwen2.5 chat template
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Solve the following math problem: What is the sum of the first 10 prime numbers?"}
]

# Apply chat template to get the prompt string
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize the input and move to device
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate output
with torch.no_grad():
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=256,
        do_sample=False, # For deterministic output
        temperature=0.1, # Low temperature for more deterministic output
        pad_token_id=tokenizer.eos_token_id # Important for Qwen2.5
    )

# Decode the generated text, excluding the input prompt
generated_text = tokenizer.batch_decode(generated_ids[:, model_inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(generated_text)

Benchmarks

Intuitor achieves:

  • Comparable performance to GRPO on in-domain math reasoning tasks (GSM8K, MATH500).
  • Superior generalization to code generation (LiveCodeBench, CRUXEval).
  • Improved instruction following, without needing any gold labels or verifiable test suites.

For detailed results, see Table 1 in the paper.

Model Name Size Method Hugging Face Link
sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH 1.5B Intuitor View Model
sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH 3B Intuitor View Model
sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH 7B Intuitor View Model
sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH 14B Intuitor View Model
sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH 1.5B GRPO View Model
sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH 3B GRPO View Model
sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH 7B GRPO View Model
sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH 14B GRPO View Model

Citation

@article{zhao2025learning,
  title   = {Learning to Reason without External Rewards},
  author  = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal = {arXiv preprint arXiv:2505.19590},
  year    = {2025}
}
Downloads last month
16
Safetensors
Model size
1.78B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(189)
this model

Collection including sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH