Qwen2.5-1.5B-Intuitor-MATH-1EPOCH / README.md

Improve model card: Add transformers library, expand description, links, and usage (#1)

27064dc verified 16 days ago

6.51 kB

	---
	base_model: Qwen/Qwen2.5-1.5B
	datasets:
	- math
	language:
	- en
	license: apache-2.0
	metrics:
	- accuracy
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Qwen2.5-1.5B-Intuitor-MATH-1EPOCH

	An Intuitor-fine-tuned version of Qwen2.5-1.5B trained on the MATH dataset.

	This model is part of the work presented in the paper [Learning to Reason without External Rewards](https://huggingface.co/papers/2505.19590).

	## Abstract

	Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable.

	## Overview

	Intuitor is a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm we call Reinforcement Learning from Internal Feedback (RLIF).

	<p align="center">
	<img src="https://raw.githubusercontent.com/sunblaze-ucb/rlif/main/figs/rlif.png" alt="RLIF Overview" width="700"/>
	</p>

	### 🧭 What is RLIF?

	Reinforcement Learning from Internal Feedback (RLIF) is a training framework where language models learn without any external rewards, gold labels, or verifiers. Instead, models improve by optimizing intrinsic signals—such as confidence in their own answers—generated entirely from within. RLIF enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable.

	Intuitor instantiates RLIF by using self-certainty—a model's confidence measured via KL divergence to uniform—as an intrinsic reward in the GRPO policy optimization algorithm.

	<p align="center">
	<img src="https://raw.githubusercontent.com/sunblaze-ucb/rlif/main/figs/intuitor.png" alt="Intuitor" width="700"/>
	</p>

	## Code

	The official code for "Learning to Reason without External Rewards" and the Intuitor framework is available on the [GitHub repository](https://github.com/sunblaze-ucb/rlif).

	## Usage

	This model can be loaded and used directly with the Hugging Face `transformers` library. Below is a basic example for text generation using the Qwen2.5 chat template:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH"

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported by your GPU
	device_map="auto"
	)
	model.eval() # Set model to evaluation mode

	# Define a conversation using the Qwen2.5 chat template
	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Solve the following math problem: What is the sum of the first 10 prime numbers?"}
	]

	# Apply chat template to get the prompt string
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	# Tokenize the input and move to device
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	# Generate output
	with torch.no_grad():
	generated_ids = model.generate(
	model_inputs.input_ids,
	max_new_tokens=256,
	do_sample=False, # For deterministic output
	temperature=0.1, # Low temperature for more deterministic output
	pad_token_id=tokenizer.eos_token_id # Important for Qwen2.5
	)

	# Decode the generated text, excluding the input prompt
	generated_text = tokenizer.batch_decode(generated_ids[:, model_inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
	print(generated_text)
	```

	## Benchmarks

	Intuitor achieves:

	* Comparable performance to GRPO on in-domain math reasoning tasks (GSM8K, MATH500).
	* Superior generalization to code generation (LiveCodeBench, CRUXEval).
	* Improved instruction following, without needing any gold labels or verifiable test suites.

	For detailed results, see Table 1 in the paper.

	\| Model Name \| Size \| Method \| Hugging Face Link \|
	\| :--------- \| :--- \| :----- \| :---------------- \|
	\| `sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH` \| 1.5B \| Intuitor \| [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH) \|
	\| `sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH` \| 3B \| Intuitor \| [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH) \|
	\| `sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH` \| 7B \| Intuitor \| [View Model](https://huggingface.co/sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH) \|
	\| `sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH` \| 14B \| Intuitor \| [View Model](https://huggingface.co/sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH) \|
	\| `sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH` \| 1.5B \| GRPO \| [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH) \|
	\| `sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH` \| 3B \| GRPO \| [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH) \|
	\| `sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH` \| 7B \| GRPO \| [View Model](https://huggingface.co/sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH) \|
	\| `sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH` \| 14B \| GRPO \| [View Model](https://huggingface.co/sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH) \|

	## Citation

	```bibtex
	@article{zhao2025learning,
	title = {Learning to Reason without External Rewards},
	author = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
	journal = {arXiv preprint arXiv:2505.19590},
	year = {2025}
	}
	```