--- base_model: - Qwen/Qwen2.5-VL-3B-Instruct datasets: - WaltonFuture/Multimodal-Cold-Start - WaltonFuture/Multimodal-RL-Data library_name: transformers license: apache-2.0 pipeline_tag: image-text-to-text --- # Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start * 🐙 **GitHub Repo:** [waltonfuture/RL-with-Cold-Start](https://github.com/waltonfuture/RL-with-Cold-Start) * 📜 **Paper (arXiv):** [Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start (arXiv:2505.22334)](https://arxiv.org/abs/2505.22334) ## Introduction This model is presented in the paper "Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start". We present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3%→73.4% on MathVista, 62.9%→70.4% on We-Math) and our 3B model achieving performance competitive with several 7B models.
Model Comparison
### ✨ Key Highlights * **Two-Stage Approach:** Combines Supervised Fine-Tuning (SFT) as a "cold start" for structured chain-of-thought reasoning with Reinforcement Learning (RL) via GRPO for further refinement. * **Enhanced Multimodal Reasoning:** Consistently outperforms both SFT-only and RL-only methods on challenging multimodal reasoning benchmarks. * **State-of-the-Art Performance:** Achieves SOTA performance among open-source MLLMs at both 3B and 7B scales. * **Significant Improvements:** The 7B model shows substantial gains (e.g., 73.4% on MathVista, 70.4% on We-Math) over base models, while the 3B model is competitive with several 7B models. * **Practical Guidance:** Provides practical insights for developing advanced multimodal reasoning models. ## Sample Usage You can easily load and use this model with the Hugging Face `transformers` library. Ensure you have `transformers` and `Pillow` installed. ```bash pip install transformers Pillow ``` Below is an example demonstrating how to perform multimodal inference: ```python from transformers import AutoProcessor, AutoModelForCausalLM from PIL import Image import torch # Load the model and processor # Replace "WaltonFuture/Qwen2.5VL-3b-RLCS" with "WaltonFuture/Qwen2.5VL-7b-RLCS" for the 7B model. model_id = "WaltonFuture/Qwen2.5VL-3b-RLCS" processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto") # Example image (replace with your image path or a PIL Image object) # Make sure to provide a valid image path. # For example, download an image locally: # import requests # from io import BytesIO # image_url = "https://www.ilusionviajera.com/wp-content/uploads/2021/04/paris-eiffel-tower-in-spring.jpg" # response = requests.get(image_url) # image = Image.open(BytesIO(response.content)).convert("RGB") image_path = "path/to/your/image.jpg" # Replace with your image path image = Image.open(image_path).convert("RGB") # Prepare the chat messages in the required multimodal format messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Describe this image in detail and answer any questions about it. For example, what is the main subject?"}, ], } ] # Apply the model's chat template to format the input text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # Process the inputs (text and image) for the model input_ids = processor(text=text, images=image, return_tensors="pt").input_ids.to(model.device) # Generate the response outputs = model.generate(input_ids=input_ids, max_new_tokens=512, do_sample=True, temperature=0.7) # Decode the generated tokens to a human-readable response response = processor.batch_decode(outputs, skip_special_tokens=True)[0] print(response) ``` ## Data Access Our two-stage datasets are now available on Hugging Face: | Stage | Data | | :------------ | :--------------------------------------------------------------------------------- | | Cold Start | [Multimodal-Cold-Start](https://huggingface.co/datasets/WaltonFuture/Multimodal-Cold-Start) | | RL | [Multimodal-RL-Data](https://huggingface.co/datasets/WaltonFuture/Multimodal-RL-Data) | ## Model Access Our models are now available on Hugging Face: | Backbone | Our model | | :------------- | :------------------------------------------------------------ | | Qwen2.5-VL-7b | [Qwen2.5VL-7b-RL-with-Cold-Start](https://huggingface.co/WaltonFuture/Qwen2.5VL-7b-RLCS) | | Qwen2.5-VL-3b | [Qwen2.5VL-3b-RL-with-Cold-Start](https://huggingface.co/WaltonFuture/Qwen2.5VL-3b-RLCS) | ## Acknowledgment Our models are built upon the amazing [Qwen2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) family. We thank [EasyR1](https://github.com/hiyouga/EasyR1) and [ms-swift](https://github.com/modelscope/ms-swift) for their training codes. ## Citation If our work has been helpful to you, please consider citing it: ```bibtex @article{wei2025advancing, title={Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start}, author={Wei, Lai and Li, Yuting and Zheng, Kaipeng and Wang, Chen and Wang, Yue and Kong, Linghe and Sun, Lichao and Huang, Weiran}, journal={arXiv preprint arXiv:2505.22334}, year={2025} } ```