---
license: apache-2.0
language:
- en
base_model:
- google/siglip-base-patch16-256-multilingual
- Qwen/Qwen2.5-3B-Instruct
pipeline_tag: image-text-to-text
tags:
- 2D_Medical_LVLMs
---

# MedM-VL-2D-3B-en

## Introduction

A 2D medical LVLM trained on **2D** medical images and **English** medical texts, enabling tasks such as **report generation**, **VQA**, referring expression comprehension (**REC**), referring expression generation (**REG**) and **image classification**.

| | Config |
| :--- | :---: |
| Image encoder | google/siglip-base-patch16-256-multilingual |
| Connector | MLP (2-layer) |
| LLM | Qwen/Qwen2.5-3B-Instruct |
| Image resolution | 256*256 |
| Sequence length | 2048 |

## Evaluation

| Benchmark | Med-Flamingo | LLaVA-Med | RadFM |**MedM-VL-2D-3B-en** |
| :--- | :---: | :---: | :---: | :---: |
| MedMNIST<sub>derma</sub>  | 0.012 | 0.258 | 0.051 | **0.786** |
| MedMNIST<sub>organ</sub>  | 0.089 | 0.668 | 0.189 | **0.808** |
| MedPix | 0.081 | **0.151** | - | 0.126 |
| MIMIC-CXR | **0.233** | 0.204 | 0.068 | 0.199 |
| PathVQA | 0.334 | 0.378 | 0.248 | **0.634** |
| SAMed<sub>identify</sub> | - | 0.458 | - | **0.693** |
| SAMed<sub>refer</sub> | - | 0.086 | - | **0.235** |
| SLAKE<sub>identify</sub> | - | 0.272 | - | **0.727** |
| SLAKE<sub>refer</sub> | - | 0.041 | - | **0.313** |
| SLAKE<sub>vqa</sub> | 0.215 | 0.337 | 0.817 | **0.841** |

## Quickstart

Please refer to [MedM-VL](https://github.com/MSIIP/MedM-VL).

## Citation

``` bibtex
@article{shi2025medm,
  title={MedM-VL: What Makes a Good Medical LVLM?},
  author={Shi, Yiming and Yang, Shaoshuai and Zhu, Xun and Wang, Haoyu and Li, Miao and Wu, Ji},
  journal={arXiv preprint arXiv:2504.04323},
  year={2025}
}
```