nielsr HF Staff commited on
Commit
9eaf86a
·
verified ·
1 Parent(s): f4f2c8f

Improve model card: Add transformers library, expand description, links, and usage

Browse files

This PR significantly enhances the model card for `Qwen2.5-1.5B-Intuitor-MATH-1EPOCH` by:

* **Adding `library_name: transformers`** to the metadata, enabling direct usage via the Transformers library and ensuring the "how to use" button appears correctly.
* **Integrating a detailed overview** of the "Intuitor" method and the "Reinforcement Learning from Internal Feedback (RLIF)" framework, including key highlights and relevant figures from the paper's GitHub repository.
* **Including a direct link** to the official GitHub repository (`https://github.com/sunblaze-ucb/rlif`), providing easy access to the source code and further resources.
* **Adding a comprehensive sample usage** snippet for text generation with the `transformers` library, demonstrating how to interact with the model using its chat template.
* **Incorporating benchmarks and results** from the paper, giving users insight into the model's performance on various reasoning and generation tasks.
* **Adding a direct link to the paper** on the Hugging Face Papers platform ([https://huggingface.co/papers/2505.19590](https://huggingface.co/papers/2505.19590)).

These updates aim to provide a more complete, accurate, and user-friendly model card for the community.

Files changed (1) hide show
  1. README.md +105 -10
README.md CHANGED
@@ -1,22 +1,118 @@
1
  ---
2
  base_model: Qwen/Qwen2.5-1.5B
3
- license: apache-2.0
4
  datasets:
5
- - math
 
 
 
6
  metrics:
7
- - accuracy
8
  pipeline_tag: text-generation
9
- language:
10
- - en
11
  ---
12
 
13
  # Qwen2.5-1.5B-Intuitor-MATH-1EPOCH
14
 
15
- **Description:**
16
-
17
  An Intuitor-fine-tuned version of Qwen2.5-1.5B trained on the MATH dataset.
18
 
19
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ## Citation
22
 
@@ -27,5 +123,4 @@ An Intuitor-fine-tuned version of Qwen2.5-1.5B trained on the MATH dataset.
27
  journal = {arXiv preprint arXiv:2505.19590},
28
  year = {2025}
29
  }
30
- ```
31
-
 
1
  ---
2
  base_model: Qwen/Qwen2.5-1.5B
 
3
  datasets:
4
+ - math
5
+ language:
6
+ - en
7
+ license: apache-2.0
8
  metrics:
9
+ - accuracy
10
  pipeline_tag: text-generation
11
+ library_name: transformers
 
12
  ---
13
 
14
  # Qwen2.5-1.5B-Intuitor-MATH-1EPOCH
15
 
 
 
16
  An Intuitor-fine-tuned version of Qwen2.5-1.5B trained on the MATH dataset.
17
 
18
+ This model is part of the work presented in the paper [**Learning to Reason without External Rewards**](https://huggingface.co/papers/2505.19590).
19
+
20
+ ## Abstract
21
+
22
+ Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable.
23
+
24
+ ## Overview
25
+
26
+ **Intuitor** is a reinforcement learning method that fine-tunes large language models (LLMs) using *self-certainty*—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm we call **Reinforcement Learning from Internal Feedback (RLIF)**.
27
+
28
+ <p align="center">
29
+ <img src="https://raw.githubusercontent.com/sunblaze-ucb/rlif/main/figs/rlif.png" alt="RLIF Overview" width="700"/>
30
+ </p>
31
+
32
+ ### 🧭 What is RLIF?
33
+
34
+ **Reinforcement Learning from Internal Feedback (RLIF)** is a training framework where language models learn *without any external rewards, gold labels, or verifiers*. Instead, models improve by optimizing *intrinsic signals*—such as confidence in their own answers—generated entirely from within. RLIF enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable.
35
+
36
+ Intuitor instantiates RLIF by using **self-certainty**—a model's confidence measured via KL divergence to uniform—as an intrinsic reward in the GRPO policy optimization algorithm.
37
+
38
+ <p align="center">
39
+ <img src="https://raw.githubusercontent.com/sunblaze-ucb/rlif/main/figs/intuitor.png" alt="Intuitor" width="700"/>
40
+ </p>
41
+
42
+ ## Code
43
+
44
+ The official code for "Learning to Reason without External Rewards" and the Intuitor framework is available on the [GitHub repository](https://github.com/sunblaze-ucb/rlif).
45
+
46
+ ## Usage
47
+
48
+ This model can be loaded and used directly with the Hugging Face `transformers` library. Below is a basic example for text generation using the Qwen2.5 chat template:
49
+
50
+ ```python
51
+ from transformers import AutoModelForCausalLM, AutoTokenizer
52
+ import torch
53
+
54
+ model_id = "sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH"
55
+
56
+ # Load tokenizer and model
57
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
58
+ model = AutoModelForCausalLM.from_pretrained(
59
+ model_id,
60
+ torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported by your GPU
61
+ device_map="auto"
62
+ )
63
+ model.eval() # Set model to evaluation mode
64
+
65
+ # Define a conversation using the Qwen2.5 chat template
66
+ messages = [
67
+ {"role": "system", "content": "You are a helpful assistant."},
68
+ {"role": "user", "content": "Solve the following math problem: What is the sum of the first 10 prime numbers?"}
69
+ ]
70
+
71
+ # Apply chat template to get the prompt string
72
+ text = tokenizer.apply_chat_template(
73
+ messages,
74
+ tokenize=False,
75
+ add_generation_prompt=True
76
+ )
77
+
78
+ # Tokenize the input and move to device
79
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
80
+
81
+ # Generate output
82
+ with torch.no_grad():
83
+ generated_ids = model.generate(
84
+ model_inputs.input_ids,
85
+ max_new_tokens=256,
86
+ do_sample=False, # For deterministic output
87
+ temperature=0.1, # Low temperature for more deterministic output
88
+ pad_token_id=tokenizer.eos_token_id # Important for Qwen2.5
89
+ )
90
+
91
+ # Decode the generated text, excluding the input prompt
92
+ generated_text = tokenizer.batch_decode(generated_ids[:, model_inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
93
+ print(generated_text)
94
+ ```
95
+
96
+ ## Benchmarks
97
+
98
+ Intuitor achieves:
99
+
100
+ * Comparable performance to GRPO on in-domain math reasoning tasks (GSM8K, MATH500).
101
+ * Superior generalization to code generation (LiveCodeBench, CRUXEval).
102
+ * Improved instruction following, without needing any gold labels or verifiable test suites.
103
+
104
+ For detailed results, see Table 1 in the paper.
105
+
106
+ | Model Name | Size | Method | Hugging Face Link |
107
+ | :--------- | :--- | :----- | :---------------- |
108
+ | `sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH` | 1.5B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH) |
109
+ | `sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH` | 3B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH) |
110
+ | `sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH` | 7B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH) |
111
+ | `sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH` | 14B | Intuitor | [View Model](https://huggingface.co/sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH) |
112
+ | `sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH` | 1.5B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH) |
113
+ | `sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH` | 3B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH) |
114
+ | `sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH` | 7B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH) |
115
+ | `sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH` | 14B | GRPO | [View Model](https://huggingface.co/sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH) |
116
 
117
  ## Citation
118
 
 
123
  journal = {arXiv preprint arXiv:2505.19590},
124
  year = {2025}
125
  }
126
+ ```