metadata

datasets:
  - reasonseg
language: en
library_name: transformers
license: other
pipeline_tag: image-segmentation
tags:
  - vision
  - segmentation

Seg-Zero-7B

This model is based on the paper Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement.

Code: https://github.com/dvlab-research/Seg-Zero

Model Overview

Seg-Zero introduces a novel framework for reasoning segmentation that addresses the limitations of traditional supervised fine-tuning methods, which often struggle with out-of-domain generalization and lack explicit reasoning processes. The framework features a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precise pixel-level masks.

Seg-Zero is trained exclusively via reinforcement learning with GRPO, without explicit reasoning data, achieving robust zero-shot generalization and emergent test-time reasoning capabilities. A sophisticated reward mechanism integrating both format and accuracy rewards guides the optimization. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18%.

Seg-Zero demonstrates the following key features:

Emergent Test-Time Reasoning: It generates a reasoning chain before producing the final segmentation mask.
Reinforcement Learning Only: Trained exclusively using reinforcement learning, without any explicit supervised reasoning data.
Superior Generalization: Achieves superior performance on both in-domain and out-of-domain data compared to supervised fine-tuning methods.

Highlight Code Features

Based on EasyR1 and veRL, which supports model split during sampling and is more GPU memory friendly.
Supports both Qwen2-VL and Qwen2.5-VL series models.
Already implementing commonly used rewards in Object Detection and Object Segmentation, including IoU reward and L1 reward.

Model Architecture

Seg-Zero employs a decoupled architecture, including a reasoning model and segmentation model. We manually design a sophisticated reward mechanism that integrates both the format and accuracy rewards.

Examples

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained("Ricky06662/Seg-Zero-7B")
tokenizer = Qwen2_5_VLForConditionalGeneration.from_pretrained("Ricky06662/Seg-Zero-7B")

Installation

git clone https://github.com/dvlab-research/Seg-Zero.git
cd Seg-Zero
conda create -n visionreasoner python=3.12
conda activate visionreasoner
pip install torch==2.6.0 torchvision==0.21.0
pip install -e .

Inference

Download pretrained models using the following scripts:

mkdir pretrained_models
cd pretrained_models
git lfs install
git clone https://huggingface.co/Ricky06662/VisionReasoner-7B

If you encounter issues with connecting to Hugging Face, consider using export HF_ENDPOINT=https://hf-mirror.com.

Then run inference using:

python inference_scripts/infer_multi_object.py

The default question is

"What can I have if I'm thirsty?"

You will get the thinking process in command line, like:

"The question asks for items that can be consumed if one is thirsty. In the image, there are two glasses that appear to contain beverages, which are the most likely candidates for something to drink. The other items, such as the salad, fruit platter, and sandwich, are not drinks and are not suitable for quenching thirst."

And the mask will be presented in inference_scripts folder.

You can also provide your own image_path and text by:

python inference_scripts/infer_multi_object.py --image_path "your_image_path" --text "your question text"

Evaluation

Evaluation Data: 🤗 ReasonSeg-Test 🤗 ReasonSeg-Val

bash evaluation_scripts/eval_reasonseg_visionreasoner.sh

Adjusting '--batch_size' in the bash scripts based on your GPU. And you will see the gIoU in your command line.

Results in VisionReasoner are evaluated within one checkpoint. We recommend you to VisionReasoner for evaluation on more tasks and more benchmarks.

However, in Seg-Zero, the best results on different benchmark are evaluated using different checkpoint. We just evaluate all available checkpoints and write down their value. For someone who may care about the performance, we suggest you can evaluate all benchmark within one model and compare the value (of our released checkpoint) in your environment.

Training

1. GRPO Training

The recommended training requirement for 7B model is a 4x80G GPUs server or a 8x46G GPUs server.

Training Data: 🤗 MultiObject-1K 🤗 MultiObject-7K Download dataset using this script:

python training_scripts/download_dataset.py

Try resize the image and re-calculate the corresponding bbox/point coordinates if you have lower GPU memory. Remember changing the corresponding resize_size in evaluation and inference.

Download pretrained models using the following scripts:

mkdir pretrained_models
cd pretrained_models
git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct

Start training using this script:

bash training_scripts/run_visionreasoner_7b_4x80G.sh

(Optional) Or you can use:

bash training_scripts/run_visionreasoner_7b_8x46G.sh

You can try change the following hyper-parameters if you have a large GPU memory.

worker.actor.micro_batch_size_per_device_for_update=1 or 2 or 4 or 8 or 16 \
worker.actor.micro_batch_size_per_device_for_experience=1 or2 or 4 or 8 or 16 \

If your GPU has less memory, you can change the following config. The number is depend on your GPU memory.

worker.rollout.tensor_parallel_size=[your number between 1-4]
worker.rollout.gpu_memory_utilization=[your number between 0-1]
worker.rollout.n=[your number between 2-32]

(Optional) If you have 8x140G GPUs, you can try:

bash training_scripts/run_visionreasoner_7b.sh

2. Merge Checkpoint in Hugging Face Format

python3 training_scripts/model_merger.py --local_dir [path_to_your_actor_checkpoint]

The GRPO Algorithm

Seg-Zero generates several samples, calculates the rewards and then optimizes towards samples that achieve higher rewards.

To learn more about the GRPO algorithm, you can refer to Hugging Face's blog.

Citation

@article{liu2025segzero,
  title        = {Seg-Zero: Reasoning-Chain Guided  Segmentation via Cognitive Reinforcement},
  author       = {Liu, Yuqi and Peng, Bohao and Zhong, Zhisheng and Yue, Zihao and Lu, Fanbin and Yu, Bei and Jia, Jiaya},
  journal      = {arXiv preprint arXiv:2503.06520},
  year         = {2025}
}

@article{liu2025visionreasoner,
  title        = {VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning},
  author       = {Liu, Yuqi and Qu, Tianyuan and Zhong, Zhisheng and Peng, Bohao and Liu, Shu and Yu, Bei and Jia, Jiaya},
  journal = {arXiv preprint arXiv:2505.12081},
  year         = {2025}
}

Acknowledgement

We would like to thank the following repos for their great work:

This work is built upon the EasyR1 and veRL.
This work utilizes models from Qwen2-VL, Qwen2.5-VL and SAM2.