MedM-VL-2D-3B-en

Introduction

A 2D medical LVLM trained on 2D medical images and English medical texts, enabling tasks such as report generation, VQA, referring expression comprehension (REC), referring expression generation (REG) and image classification.

Config
Image encoder google/siglip-base-patch16-256-multilingual
Connector MLP (2-layer)
LLM Qwen/Qwen2.5-3B-Instruct
Image resolution 256*256
Sequence length 2048

Evaluation

Benchmark Med-Flamingo LLaVA-Med RadFM MedM-VL-2D-3B-en
MedMNISTderma 0.012 0.258 0.051 0.786
MedMNISTorgan 0.089 0.668 0.189 0.808
MedPix 0.081 0.151 - 0.126
MIMIC-CXR 0.233 0.204 0.068 0.199
PathVQA 0.334 0.378 0.248 0.634
SAMedidentify - 0.458 - 0.693
SAMedrefer - 0.086 - 0.235
SLAKEidentify - 0.272 - 0.727
SLAKErefer - 0.041 - 0.313
SLAKEvqa 0.215 0.337 0.817 0.841

Quickstart

Please refer to MedM-VL.

Citation

@article{shi2025medm,
  title={MedM-VL: What Makes a Good Medical LVLM?},
  author={Shi, Yiming and Yang, Shaoshuai and Zhu, Xun and Wang, Haoyu and Li, Miao and Wu, Ji},
  journal={arXiv preprint arXiv:2504.04323},
  year={2025}
}
Downloads last month
34
Safetensors
Model size
3.18B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shiym2000/MedM-VL-2D-3B-en

Base model

Qwen/Qwen2.5-3B
Finetuned
(425)
this model

Collection including shiym2000/MedM-VL-2D-3B-en