File size: 11,452 Bytes
12b7d9e
 
 
 
 
 
 
 
 
 
 
 
5692497
12b7d9e
a834b1d
12b7d9e
 
 
 
 
64bf3c4
12b7d9e
 
 
d285fa4
 
12b7d9e
d285fa4
12b7d9e
d285fa4
 
 
12b7d9e
d285fa4
12b7d9e
d285fa4
 
12b7d9e
d285fa4
0e85b0d
d285fa4
12b7d9e
 
 
d285fa4
12b7d9e
 
 
d285fa4
12b7d9e
 
 
d285fa4
12b7d9e
 
 
d285fa4
12b7d9e
d285fa4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12b7d9e
d285fa4
12b7d9e
 
 
d285fa4
12b7d9e
d285fa4
 
 
12b7d9e
d285fa4
12b7d9e
d285fa4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12b7d9e
d285fa4
12b7d9e
d285fa4
3d31747
 
 
d285fa4
12b7d9e
3d31747
6f36695
 
68ca8e9
3d31747
 
 
68ca8e9
3d31747
 
68ca8e9
3d31747
 
68ca8e9
 
 
 
 
 
 
 
 
 
3d31747
 
 
68ca8e9
3d31747
 
68ca8e9
3d31747
 
 
68ca8e9
3d31747
 
68ca8e9
3d31747
68ca8e9
3d31747
 
 
 
 
 
 
68ca8e9
d285fa4
12b7d9e
3d31747
12b7d9e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3d31747
12b7d9e
d285fa4
12b7d9e
d285fa4
12b7d9e
d285fa4
12b7d9e
d285fa4
12b7d9e
d285fa4
12b7d9e
d285fa4
12b7d9e
d285fa4
12b7d9e
d285fa4
 
12b7d9e
d285fa4
12b7d9e
d285fa4
 
 
12b7d9e
d285fa4
12b7d9e
d285fa4
 
12b7d9e
 
d285fa4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
license: mit
datasets:
- array/SAT
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen2-VL-2B
tags:
- r1
pipeline_tag: image-text-to-text
---
# VisualThinker-R1-Zero
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->

<div align="center">
  <img src="https://multimodal-r1.s3.us-west-1.amazonaws.com/TurningPoint_transparent.png" width="20%" alt="TurningPoint" />
</div>
<hr>
<div align="center" style="line-height: 1;">
  <a href="https://www.turningpoint-ai.com/" target="_blank" style="margin: 2px;">
    <img alt="Homepage" src="https://img.shields.io/badge/🐳Homepage-TurningPointAI-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <!-- <a href="https://chat.deepseek.com/" target="_blank" style="margin: 2px;">
    <img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-DeepSeek%20R1-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a> -->
  <a href="https://huggingface.co/turningpoint-ai" target="_blank" style="margin: 2px;">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-TurningPoint%20AI-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <!-- <a href="https://discord.gg/Tc7c45Zzu5" target="_blank" style="margin: 2px;">
    <img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/>
  </a> -->
  <!-- <a href="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/qr.jpeg?raw=true" target="_blank" style="margin: 2px;">
    <img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a> -->
  <a href="https://x.com/TurningPointAI" target="_blank" style="margin: 2px;">
    <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-TurningPoint_AI-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>

<!-- <div align="center" style="line-height: 1;">
  <a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/LICENSE" style="margin: 2px;">
    <img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div> -->


<p align="center">
  <a href="https://arxiv.org/pdf/2503.05132"><b>Paper Link</b>👁️</a>
</p>


## 🚀 Introduction

The recent DeepSeek-R1 demonstrated how reinforcement learning with simple
rule-based reward can enable autonomous development of complex reasoning in
large language models, characterized by the "aha moment", in which the model
manifest self-reflection and increased response length during training. However,
attempts to extend this success to multimodal reasoning often failed to reproduce
these key characteristics. In this report, we present the first successful replication
of these emergent characteristics for multimodal reasoning on only a non-SFT
2B model. Starting with Qwen2-VL-2B and applying reinforcement learning
directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench,
outperforming the base model by approximately ~30% and exceeding both SFT
setting by ~2%. In addition, we share our failed attempts and insights in attempting
to achieve R1-like reasoning using RL with instruct models, aiming to shed light on
the challenges involved. Our key observations include: (1) applying RL on instruct
model often results in trivial reasoning trajectories, and (2) naive length reward
are ineffective in eliciting reasoning capabilities. The project code is available at
https://github.com/turningpoint-ai/VisualThinker-R1-Zero

<!-- **NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the [Usage Recommendation](#usage-recommendations) section.**

<p align="center">
  <img width="80%" src="figures/benchmark.jpg">
</p> -->

## 🔮 Highlights
1. We are the **first to successfully produce the emergent “aha moment” and increased response length** for multimodal reasoning on just a **non-SFT 2B model**.
2. We showed that **vision-centric** tasks could also benefit from improved reasoning capabilities.  

Similar to DeepSeek R1, self reflection behavior is also observed during our RL training on vision-centric reasoning tasks. The model exhibits an emergent ability to rethink and correct its mistakes:

```
. . .
Therefore, dark brown wooden bed with white blanket is not above the doorway.
But wait! I can think of something else.
Maybe it's just higher than above the doorway, but slightly lower than above the doorway.
. . .
```
## ⚙️ Requirements and Installation
* Python >= 3.10
* Pytorch == 2.0.1
* CUDA Version >= 11.7
* Install required packages:
```bash
# install transformers
pip install git+https://github.com/huggingface/transformers
# install qwen-vl utils
pip install qwen-vl-utils
```

## 💻 Model Downloads and Usage

```
from PIL import Image
import requests
from io import BytesIO
from transformers import AutoProcessor, AutoModelForImageTextToText

# Load model directly
processor = AutoProcessor.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero"
                                                    ,torch_dtype="auto", device_map="auto")
model.eval()

# Prepare image input
image_url = "https://multimodal-r1.s3.us-west-1.amazonaws.com/demo_image.jpg"

# Prepare text input
question = "Considering the relative positions of the sofa and the picture in the image provided, where is the sofa located with respect to the picture? Select from the following choices.\n(A) above or \n(B) below"
prompt = f"A conversation between User and Assistant. The user asks a question about the image, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: {question} \nAssistant: Let me solve this step by step.\n<think>"

# Create Message
message = [

                        {
                            "type": "image",
                            "image": image_url,
                        },
                        {"type": "text", "text": "<image>" + prompt},
                    ]

# Process input
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
text = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
input = processor(
                text=text,
                image=image,
                padding=True,
                return_tensors="pt",
            )
input = input.to("cuda")

# Generation of the output
generated_ids = model.generate(**input, use_cache=True, max_new_tokens=1024, do_sample=True)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(input.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

# Get output
output_text = batch_output_text[0]
print(output_text)
```

<!-- ## 📰 Evaluation Results

### DeepSeek-R1-Evaluation
 For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
<div align="center">


| Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 |
|----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------|
| | Architecture | - | - | MoE | - | - | MoE |
| | # Activated Params | - | - | 37B | - | - | 37B |
| | # Total Params | - | - | 671B | - | - | 671B |
| English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | **91.8** | 90.8 |
| | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | **92.9** |
| | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | **84.0** |
| | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | **92.2** |
| | IF-Eval (Prompt Strict) | **86.5** | 84.3 | 86.1 | 84.8 | - | 83.3 |
| | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | **75.7** | 71.5 |
| | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | **47.0** | 30.1 |
| | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | **82.5** |
| | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | **87.6** |
| | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | **92.3** |
| Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | **65.9** |
| | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | **96.6** | 96.3 |
| | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | **2061** | 2029 |
| | SWE Verified (Resolved) | **50.8** | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 |
| | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | **61.7** | 53.3 |
| Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | **79.8** |
| | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | **97.3** |
| | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | **78.8** |
| Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | **92.8** |
| | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** |
| | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |

</div> -->

## 🙌 Stay Connected!

We are always open to engaging discussions, collaborations, or even just sharing a virtual coffee. To get in touch or join our team, visit [TurningPoint AI](https://www.turningpoint-ai.com/)'s homepage for contact information.

## 📖 Acknowledgements

We sincerely thank [DeepSeek](https://github.com/deepseek-ai/DeepSeek-R1), [Open-R1](https://github.com/huggingface/open-r1), [QwenVL](https://github.com/QwenLM/Qwen2.5-VL), [Open-R1-Multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal), [R1-V](https://github.com/Deep-Agent/R1-V), [SAT](https://arxiv.org/abs/2412.07755), and [CV-Bench](https://cambrian-mllm.github.io/) for providing open source resources that laid the foundation of our project. 

## 🤝 Contributors

Here are the key contributors from [TurningPoint AI](https://www.turningpoint-ai.com/) to this project:

[Hengguang Zhou](https://hengguangzhou.github.io/)<sup>1</sup><sup>* </sup>, [Xirui Li](https://xirui-li.github.io/)<sup>1</sup><sup>* </sup>, [Ruochen Wang](https://ruocwang.github.io/)<sup>1</sup><sup></sup>, [Minhao Cheng](https://cmhcbb.github.io/)<sup>2</sup>, [Tianyi Zhou](https://tianyizhou.github.io/)<sup>3</sup> and [Cho-Jui Hsieh](https://web.cs.ucla.edu/~chohsieh/)<sup>1</sup><sup>4</sup>

<sup>*</sup> Project Leads, <sup></sup> Main Advisor
<sup>1</sup>University of California, Los Angeles, <sup>2</sup>Penn State University, <sup>3</sup>University of Maryland and <sup>4</sup>Google Research

## ✏️ Citation
```
@misc{zhou2025r1zerosahamomentvisual,
      title={R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model}, 
      author={Hengguang Zhou and Xirui Li and Ruochen Wang and Minhao Cheng and Tianyi Zhou and Cho-Jui Hsieh},
      year={2025},
      eprint={2503.05132},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.05132}, 
}

```