nielsr HF Staff commited on
Commit
47c9e15
·
verified ·
1 Parent(s): 9d6378f

Enhance model card for Seg-Zero-7B with detailed overview, usage, and training info

Browse files

This PR significantly enhances the Seg-Zero-7B model card by integrating comprehensive details from the project's official GitHub repository and the paper abstract. Key improvements include:

- **Expanded Model Overview**: A more detailed description drawing from the paper's abstract, highlighting the unique reinforcement learning approach, decoupled architecture, and key performance metrics.
- **Visuals and Features**: Added overview and architecture diagrams, along with explicit lists of Seg-Zero's and the code's highlighted features, directly from the GitHub README.
- **Updated Installation and Usage**: Replaced outdated installation and inference instructions with the latest, more comprehensive details from the GitHub repository, including multi-object inference.
- **Examples Section**: Incorporated visual examples to demonstrate the model's output.
- **Evaluation and Training Guides**: Added dedicated sections for evaluation and training, including dataset links, recommended hardware, and scripts for reproducibility.
- **GRPO Algorithm Explanation**: Included a brief explanation and diagram of the underlying GRPO algorithm.
- **Citation and Acknowledgements**: Added the full BibTeX citations for the relevant papers and an acknowledgement section, ensuring proper attribution.

These updates provide a much richer and more practical resource for users interacting with the Seg-Zero-7B model on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +171 -15
README.md CHANGED
@@ -1,11 +1,10 @@
1
-
2
  ---
3
  datasets:
4
  - reasonseg
5
  language: en
 
6
  license: other
7
  pipeline_tag: image-segmentation
8
- library_name: transformers
9
  tags:
10
  - vision
11
  - segmentation
@@ -13,13 +12,44 @@ tags:
13
 
14
  # Seg-Zero-7B
15
 
16
- This model is based on the paper [Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement](https://huggingface.co/papers/2503.06520). It uses a decoupled architecture with a reasoning model and a segmentation model. It's trained via reinforcement learning using GRPO without explicit reasoning data, leading to robust zero-shot generalization and emergent test-time reasoning.
17
 
18
  Code: https://github.com/dvlab-research/Seg-Zero
19
 
20
- ## Description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- This is a Seg-Zero-7B model. It introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate pixel-level masks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## Usage
25
 
@@ -37,26 +67,152 @@ tokenizer = Qwen2_5_VLForConditionalGeneration.from_pretrained("Ricky06662/Seg-Z
37
  ```bash
38
  git clone https://github.com/dvlab-research/Seg-Zero.git
39
  cd Seg-Zero
40
- conda create -n seg_zero python=3.11
41
- conda activate seg_zero
42
- pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
43
  pip install -e .
44
- pip install sam2
45
- pip install matplotlib
46
  ```
47
 
48
  ## Inference
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ```bash
51
- python inference_scripts/infer.py
52
  ```
53
 
54
- The default question is:
 
55
 
56
- > "the unusual object in the image."
 
 
 
 
 
 
57
 
58
- You will get the thinking process in the command line and the mask will be saved in the **inference_scripts** folder. You can also provide your own image_path and text:
 
 
 
 
 
 
 
59
 
 
60
  ```bash
61
- python inference_scripts/infer.py --image_path "your_image_path" --text "your question text"
 
62
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  datasets:
3
  - reasonseg
4
  language: en
5
+ library_name: transformers
6
  license: other
7
  pipeline_tag: image-segmentation
 
8
  tags:
9
  - vision
10
  - segmentation
 
12
 
13
  # Seg-Zero-7B
14
 
15
+ This model is based on the paper [Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement](https://huggingface.co/papers/2503.06520).
16
 
17
  Code: https://github.com/dvlab-research/Seg-Zero
18
 
19
+ ## Model Overview
20
+
21
+ Seg-Zero introduces a novel framework for reasoning segmentation that addresses the limitations of traditional supervised fine-tuning methods, which often struggle with out-of-domain generalization and lack explicit reasoning processes. The framework features a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precise pixel-level masks.
22
+
23
+ Seg-Zero is trained exclusively via reinforcement learning with GRPO, without explicit reasoning data, achieving robust zero-shot generalization and emergent test-time reasoning capabilities. A sophisticated reward mechanism integrating both format and accuracy rewards guides the optimization. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18%.
24
+
25
+ <div align=center>
26
+ <img width="98%" src="https://huggingface.co/Ricky06662/Seg-Zero-7B/resolve/main/assets/overview.png"/>
27
+ </div>
28
+
29
+ Seg-Zero demonstrates the following key features:
30
+ 1. **Emergent Test-Time Reasoning**: It generates a reasoning chain before producing the final segmentation mask.
31
+ 2. **Reinforcement Learning Only**: Trained exclusively using reinforcement learning, without any explicit supervised reasoning data.
32
+ 3. **Superior Generalization**: Achieves superior performance on both in-domain and out-of-domain data compared to supervised fine-tuning methods.
33
+
34
+ ### Highlight Code Features
35
 
36
+ * Based on EasyR1 and veRL, which supports model split during sampling and is more GPU memory friendly.
37
+ * Supports both Qwen2-VL and Qwen2.5-VL series models.
38
+ * Already implementing commonly used rewards in Object Detection and Object Segmentation, including IoU reward and L1 reward.
39
+
40
+ ## Model Architecture
41
+
42
+ Seg-Zero employs a decoupled architecture, including a reasoning model and segmentation model. We manually design a sophisticated reward mechanism that integrates both the format and accuracy rewards.
43
+
44
+ <div align=center>
45
+ <img width="98%" src="https://huggingface.co/Ricky06662/Seg-Zero-7B/resolve/main/assets/pipeline.png"/>
46
+ </div>
47
+
48
+ ## Examples
49
+
50
+ <div align=center>
51
+ <img width="98%" src="https://huggingface.co/Ricky06662/Seg-Zero-7B/resolve/main/assets/examples.png"/>
52
+ </div>
53
 
54
  ## Usage
55
 
 
67
  ```bash
68
  git clone https://github.com/dvlab-research/Seg-Zero.git
69
  cd Seg-Zero
70
+ conda create -n visionreasoner python=3.12
71
+ conda activate visionreasoner
72
+ pip install torch==2.6.0 torchvision==0.21.0
73
  pip install -e .
 
 
74
  ```
75
 
76
  ## Inference
77
 
78
+ Download pretrained models using the following scripts:
79
+ ```bash
80
+ mkdir pretrained_models
81
+ cd pretrained_models
82
+ git lfs install
83
+ git clone https://huggingface.co/Ricky06662/VisionReasoner-7B
84
+ ```
85
+
86
+ > [!TIP]
87
+ > If you encounter issues with connecting to Hugging Face, consider using `export HF_ENDPOINT=https://hf-mirror.com`.
88
+
89
+ Then run inference using:
90
+ ```bash
91
+ python inference_scripts/infer_multi_object.py
92
+ ```
93
+ The default question is
94
+ > "What can I have if I'm thirsty?"
95
+
96
+ You will get the thinking process in command line, like:
97
+
98
+ > "The question asks for items that can be consumed if one is thirsty. In the image, there are two glasses that appear to contain beverages, which are the most likely candidates for something to drink. The other items, such as the salad, fruit platter, and sandwich, are not drinks and are not suitable for quenching thirst."
99
+
100
+ And the mask will be presented in **inference_scripts** folder.
101
+
102
+ <div align=center>
103
+ <img width="98%" src="https://huggingface.co/Ricky06662/Seg-Zero-7B/resolve/main/assets/test_output_multiobject.png"/>
104
+ </div>
105
+
106
+ You can also provide your own image_path and text by:
107
+ ```bash
108
+ python inference_scripts/infer_multi_object.py --image_path "your_image_path" --text "your question text"
109
+ ```
110
+
111
+ ## Evaluation
112
+
113
+ Evaluation Data: [🤗 ReasonSeg-Test](https://huggingface.co/datasets/Ricky06662/ReasonSeg_test) [🤗 ReasonSeg-Val](https://huggingface.co/datasets/Ricky06662/ReasonSeg_val)
114
+
115
+ ```bash
116
+ bash evaluation_scripts/eval_reasonseg_visionreasoner.sh
117
+ ```
118
+ Adjusting '--batch_size' in the bash scripts based on your GPU. And you will see the gIoU in your command line.
119
+ <div align=center>
120
+ <img width="98%" src="https://huggingface.co/Ricky06662/Seg-Zero-7B/resolve/main/assets/val_results.png"/>
121
+ </div>
122
+
123
+ > [!NOTE]
124
+ > Results in VisionReasoner are evaluated within one checkpoint. We recommend you to [VisionReasoner](https://github.com/dvlab-research/VisionReasoner) for evaluation on more tasks and more benchmarks.
125
+
126
+ > [!NOTE]
127
+ > However, in Seg-Zero, the best results on different benchmark are evaluated using different checkpoint. We just evaluate all available checkpoints and write down their value. For someone who may care about the performance, we suggest you can evaluate all benchmark within one model and compare the value (of our released checkpoint) in your environment.
128
+
129
+ ## Training
130
+
131
+ ### 1. GRPO Training
132
+
133
+ > [!NOTE]
134
+ > The recommended training requirement for 7B model is a 4x80G GPUs server or a 8x46G GPUs server.
135
+
136
+ Training Data: [🤗 MultiObject-1K](https://huggingface.co/datasets/Ricky06662/VisionReasoner_multi_object_1k_840) [🤗 MultiObject-7K](https://huggingface.co/datasets/Ricky06662/VisionReasoner_multi_object_7k_840)
137
+ Download dataset using this script:
138
  ```bash
139
+ python training_scripts/download_dataset.py
140
  ```
141
 
142
+ > [!TIP]
143
+ > Try resize the image and re-calculate the corresponding bbox/point coordinates if you have lower GPU memory. Remember changing the corresponding resize_size in evaluation and inference.
144
 
145
+ Download pretrained models using the following scripts:
146
+ ```bash
147
+ mkdir pretrained_models
148
+ cd pretrained_models
149
+ git lfs install
150
+ git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
151
+ ```
152
 
153
+ Start training using this script:
154
+ ```bash
155
+ bash training_scripts/run_visionreasoner_7b_4x80G.sh
156
+ ```
157
+ (Optional) Or you can use:
158
+ ```bash
159
+ bash training_scripts/run_visionreasoner_7b_8x46G.sh
160
+ ```
161
 
162
+ You can try change the following hyper-parameters if you have a large GPU memory.
163
  ```bash
164
+ worker.actor.micro_batch_size_per_device_for_update=1 or 2 or 4 or 8 or 16 \
165
+ worker.actor.micro_batch_size_per_device_for_experience=1 or2 or 4 or 8 or 16 \
166
  ```
167
+ If your GPU has less memory, you can change the following config. The number is depend on your GPU memory.
168
+ ```bash
169
+ worker.rollout.tensor_parallel_size=[your number between 1-4]
170
+ worker.rollout.gpu_memory_utilization=[your number between 0-1]
171
+ worker.rollout.n=[your number between 2-32]
172
+ ```
173
+
174
+ (Optional) If you have 8x140G GPUs, you can try:
175
+ ```bash
176
+ bash training_scripts/run_visionreasoner_7b.sh
177
+ ```
178
+
179
+ ### 2. Merge Checkpoint in Hugging Face Format
180
+
181
+ ```bash
182
+ python3 training_scripts/model_merger.py --local_dir [path_to_your_actor_checkpoint]
183
+ ```
184
+
185
+ ## The GRPO Algorithm
186
+
187
+ <div align=center>
188
+ <img width="48%" src="https://huggingface.co/Ricky06662/Seg-Zero-7B/resolve/main/assets/rl_sample.png"/>
189
+ </div>
190
+
191
+ Seg-Zero generates several samples, calculates the rewards and then optimizes towards samples that achieve higher rewards.
192
+
193
+ > [!TIP]
194
+ > To learn more about the GRPO algorithm, you can refer to [Hugging Face's blog](https://huggingface.co/docs/trl/v0.15.2/en/grpo_trainer).
195
+
196
+ ## Citation
197
+
198
+ ```bibtex
199
+ @article{liu2025segzero,
200
+ title = {Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement},
201
+ author = {Liu, Yuqi and Peng, Bohao and Zhong, Zhisheng and Yue, Zihao and Lu, Fanbin and Yu, Bei and Jia, Jiaya},
202
+ journal = {arXiv preprint arXiv:2503.06520},
203
+ year = {2025}
204
+ }
205
+
206
+ @article{liu2025visionreasoner,
207
+ title = {VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning},
208
+ author = {Liu, Yuqi and Qu, Tianyuan and Zhong, Zhisheng and Peng, Bohao and Liu, Shu and Yu, Bei and Jia, Jiaya},
209
+ journal = {arXiv preprint arXiv:2505.12081},
210
+ year = {2025}
211
+ }
212
+ ```
213
+
214
+ ## Acknowledgement
215
+ We would like to thank the following repos for their great work:
216
+
217
+ * This work is built upon the [EasyR1](https://github.com/hiyouga/EasyR1) and [veRL](https://github.com/volcengine/verl).
218
+ * This work utilizes models from [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct), [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and [SAM2](https://huggingface.co/facebook/sam2-hiera-large).