Improve model card: Add transformers library, expand description, links, and usage
Browse filesThis PR significantly enhances the model card for `Qwen2.5-1.5B-Intuitor-MATH-1EPOCH` by:
* **Adding `library_name: transformers`** to the metadata, enabling direct usage via the Transformers library and ensuring the "how to use" button appears correctly.
* **Integrating a detailed overview** of the "Intuitor" method and the "Reinforcement Learning from Internal Feedback (RLIF)" framework, including key highlights and relevant figures from the paper's GitHub repository.
* **Including a direct link** to the official GitHub repository (`https://github.com/sunblaze-ucb/rlif`), providing easy access to the source code and further resources.
* **Adding a comprehensive sample usage** snippet for text generation with the `transformers` library, demonstrating how to interact with the model using its chat template.
* **Incorporating benchmarks and results** from the paper, giving users insight into the model's performance on various reasoning and generation tasks.
* **Adding a direct link to the paper** on the Hugging Face Papers platform ([https://huggingface.co/papers/2505.19590](https://huggingface.co/papers/2505.19590)).
These updates aim to provide a more complete, accurate, and user-friendly model card for the community.
@@ -1,22 +1,118 @@
|
|
1 |
---
|
2 |
base_model: Qwen/Qwen2.5-1.5B
|
3 |
-
license: apache-2.0
|
4 |
datasets:
|
5 |
-
|
|
|
|
|
|
|
6 |
metrics:
|
7 |
-
|
8 |
pipeline_tag: text-generation
|
9 |
-
|
10 |
-
- en
|
11 |
---
|
12 |
|
13 |
# Qwen2.5-1.5B-Intuitor-MATH-1EPOCH
|
14 |
|
15 |
-
**Description:**
|
16 |
-
|
17 |
An Intuitor-fine-tuned version of Qwen2.5-1.5B trained on the MATH dataset.
|
18 |
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
## Citation
|
22 |
|
@@ -27,5 +123,4 @@ An Intuitor-fine-tuned version of Qwen2.5-1.5B trained on the MATH dataset.
|
|
27 |
journal = {arXiv preprint arXiv:2505.19590},
|
28 |
year = {2025}
|
29 |
}
|
30 |
-
```
|
31 |
-
|
|
|
1 |
---
|
2 |
base_model: Qwen/Qwen2.5-1.5B
|
|
|
3 |
datasets:
|
4 |
+
- math
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
license: apache-2.0
|
8 |
metrics:
|
9 |
+
- accuracy
|
10 |
pipeline_tag: text-generation
|
11 |
+
library_name: transformers
|
|
|
12 |
---
|
13 |
|
14 |
# Qwen2.5-1.5B-Intuitor-MATH-1EPOCH
|
15 |
|
|
|
|
|
16 |
An Intuitor-fine-tuned version of Qwen2.5-1.5B trained on the MATH dataset.
|
17 |
|
18 |
+
This model is part of the work presented in the paper [**Learning to Reason without External Rewards**](https://huggingface.co/papers/2505.19590).
|
19 |
+
|
20 |
+
## Abstract
|
21 |
+
|
22 |
+
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable.
|
23 |
+
|
24 |
+
## Overview
|
25 |
+
|
26 |
+
**Intuitor** is a reinforcement learning method that fine-tunes large language models (LLMs) using *self-certainty*—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm we call **Reinforcement Learning from Internal Feedback (RLIF)**.
|
27 |
+
|
28 |
+
<p align="center">
|
29 |
+
<img src="https://raw.githubusercontent.com/sunblaze-ucb/rlif/main/figs/rlif.png" alt="RLIF Overview" width="700"/>
|
30 |
+
</p>
|
31 |
+
|
32 |
+
### 🧭 What is RLIF?
|
33 |
+
|
34 |
+
**Reinforcement Learning from Internal Feedback (RLIF)** is a training framework where language models learn *without any external rewards, gold labels, or verifiers*. Instead, models improve by optimizing *intrinsic signals*—such as confidence in their own answers—generated entirely from within. RLIF enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable.
|
35 |
+
|
36 |
+
Intuitor instantiates RLIF by using **self-certainty**—a model's confidence measured via KL divergence to uniform—as an intrinsic reward in the GRPO policy optimization algorithm.
|
37 |
+
|
38 |
+
<p align="center">
|
39 |
+
<img src="https://raw.githubusercontent.com/sunblaze-ucb/rlif/main/figs/intuitor.png" alt="Intuitor" width="700"/>
|
40 |
+
</p>
|
41 |
+
|
42 |
+
## Code
|
43 |
+
|
44 |
+
The official code for "Learning to Reason without External Rewards" and the Intuitor framework is available on the [GitHub repository](https://github.com/sunblaze-ucb/rlif).
|
45 |
+
|
46 |
+
## Usage
|
47 |
+
|
48 |
+
This model can be loaded and used directly with the Hugging Face `transformers` library. Below is a basic example for text generation using the Qwen2.5 chat template:
|
49 |
+
|
50 |
+
```python
|
51 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
52 |
+
import torch
|
53 |
+
|
54 |
+
model_id = "sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH"
|
55 |
+
|
56 |
+
# Load tokenizer and model
|
57 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
58 |
+
model = AutoModelForCausalLM.from_pretrained(
|
59 |
+
model_id,
|
60 |
+
torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported by your GPU
|
61 |
+
device_map="auto"
|
62 |
+
)
|
63 |
+
model.eval() # Set model to evaluation mode
|
64 |
+
|
65 |
+
# Define a conversation using the Qwen2.5 chat template
|
66 |
+
messages = [
|
67 |
+
{"role": "system", "content": "You are a helpful assistant."},
|
68 |
+
{"role": "user", "content": "Solve the following math problem: What is the sum of the first 10 prime numbers?"}
|
69 |
+
]
|
70 |
+
|
71 |
+
# Apply chat template to get the prompt string
|
72 |
+
text = tokenizer.apply_chat_template(
|
73 |
+
messages,
|
74 |
+
tokenize=False,
|
75 |
+
add_generation_prompt=True
|
76 |
+
)
|
77 |
+
|
78 |
+
# Tokenize the input and move to device
|
79 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
80 |
+
|
81 |
+
# Generate output
|
82 |
+
with torch.no_grad():
|
83 |
+
generated_ids = model.generate(
|
84 |
+
model_inputs.input_ids,
|
85 |
+
max_new_tokens=256,
|
86 |
+
do_sample=False, # For deterministic output
|
87 |
+
temperature=0.1, # Low temperature for more deterministic output
|
88 |
+
pad_token_id=tokenizer.eos_token_id # Important for Qwen2.5
|
89 |
+
)
|
90 |
+
|
91 |
+
# Decode the generated text, excluding the input prompt
|
92 |
+
generated_text = tokenizer.batch_decode(generated_ids[:, model_inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
|
93 |
+
print(generated_text)
|
94 |
+
```
|
95 |
+
|
96 |
+
## Benchmarks
|
97 |
+
|
98 |
+
Intuitor achieves:
|
99 |
+
|
100 |
+
* Comparable performance to GRPO on in-domain math reasoning tasks (GSM8K, MATH500).
|
101 |
+
* Superior generalization to code generation (LiveCodeBench, CRUXEval).
|
102 |
+
* Improved instruction following, without needing any gold labels or verifiable test suites.
|
103 |
+
|
104 |
+
For detailed results, see Table 1 in the paper.
|
105 |
+
|
106 |
+
| Model Name | Size | Method | Hugging Face Link |
|
107 |
+
| :--------- | :--- | :----- | :---------------- |
|
108 |
+
| `sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH` | 1.5B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH) |
|
109 |
+
| `sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH` | 3B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH) |
|
110 |
+
| `sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH` | 7B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH) |
|
111 |
+
| `sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH` | 14B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH) |
|
112 |
+
| `sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH` | 1.5B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH) |
|
113 |
+
| `sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH` | 3B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH) |
|
114 |
+
| `sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH` | 7B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH) |
|
115 |
+
| `sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH` | 14B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH) |
|
116 |
|
117 |
## Citation
|
118 |
|
|
|
123 |
journal = {arXiv preprint arXiv:2505.19590},
|
124 |
year = {2025}
|
125 |
}
|
126 |
+
```
|
|