🔥 LLaVA-MORE 🔥 A Comparative Study of LLMs and Visual Backbones
for Enhanced Visual Instruction Tuning

---
library_name: transformers
pipeline_tag: image-text-to-text
license: apache-2.0
tags:
  - multimodal
  - vision-language-model
  - llava
  - instruction-tuned
  - phi-4
  - vqa
base_model: microsoft/Phi-4-mini-instruct
---

# Model Card for LLaVA_MORE-phi_4-finetuning

<div align="center">
  <!-- <img src="https://github.com/aimagelab/LLaVA-MORE/blob/main/images/image_no_back.png" width="200" height="200"> -->
  <h1>  🔥 LLaVA-MORE 🔥
    
 A Comparative Study of LLMs and Visual Backbones <br>for Enhanced Visual Instruction Tuning
  </h1>  
</div>

This model is part of the **LLaVA-MORE** family of Multimodal Large Language Models (MLLMs), presented in the paper [LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning](https://huggingface.co/papers/2503.15621).

LLaVA-MORE integrates recent language models with diverse visual backbones. It employs a unified training protocol applied consistently across all architectures to ensure fair comparisons and systematically explore the trade-offs between model size, architecture, and performance. This model, `LLaVA_MORE-phi_4-finetuning`, uses **Phi-4 Instruct** as its LLM backbone and is finetuned on the [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) dataset.

It is designed for multimodal reasoning, generation, and instruction following, and provides insights into the design of more effective MLLMs.

## Citation
If you make use of our work, please cite our paper:

```bibtex
@inproceedings{cocchi2025llava,
      title={{LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning}},
      author={Cocchi, Federico and Moratelli, Nicholas and Caffagni, Davide and Sarto, Sara and Baraldi, Lorenzo and Cornia, Marcella and Cucchiara, Rita},
      booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},
      year={2025}
}
```

## Model Details

### Model Description

This is a checkpoint from the LLaVA-MORE family of MLLMs. It integrates the **Phi-4 Instruct** Large Language Model with a visual backbone (specifically, `openai/clip-vit-large-patch14-336` as per `config.json`). It has been finetuned on the `LLaVA-Instruct-665K` dataset. The project aims to provide a reproducible evaluation framework to guide future model development by systematically studying the impact of different LLMs and visual encoders, as well as factors like image resolution and pre-training datasets.

-   **Developed by:** Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia and Rita Cucchiara (AImageLab, University of Modena and Reggio Emilia).
-   **Model type:** Multimodal Large Language Model (MLLM) / Vision-Language Model
-   **Language(s):** English
-   **License:** Apache-2.0
-   **Finetuned from model:** `microsoft/Phi-4-mini-instruct`

### Model Sources

-   **Repository:** [https://github.com/aimagelab/LLaVA-MORE](https://github.com/aimagelab/LLaVA-MORE)
-   **Paper:** [https://huggingface.co/papers/2503.15621](https://huggingface.co/papers/2503.15621)
-   **Project Website:** [https://aimagelab.ing.unimore.it/imagelab](https://aimagelab.ing.unimore.it/imagelab)
-   **Hugging Face Collection:** [LLaVA-MORE Models](https://huggingface.co/collections/aimagelab/llava-more-66aa6c49167e190bf27e7be4)
-   **Hugging Face Demo:** [https://huggingface.co/spaces/aimagelab/LLaVA-MORE](https://huggingface.co/spaces/aimagelab/LLaVA-MORE)

## Uses

### Direct Use

This model is intended for various multimodal reasoning, generation, and instruction-following tasks. It can be used to process visual inputs in conjunction with textual prompts to generate informative and relevant text responses. Typical applications include visual question answering, image captioning, and conversational AI involving images.

### Out-of-Scope Use

This model is not intended for generating harmful content, engaging in misinformation, or being deployed in applications without proper human oversight. As an AI model, it may hallucinate or produce factually incorrect information. It should not be used in safety-critical applications without thorough domain-specific evaluation and mitigation strategies.

## Bias, Risks, and Limitations

Given that the model is trained on large datasets, it may inherit biases present in the data, leading to biased outputs. Potential risks include generating offensive, inaccurate, or harmful content. Like all generative models, it may also hallucinate or provide factually incorrect information.

### Recommendations

Users should be aware of the inherent biases and limitations of MLLMs. It is recommended to apply human review to outputs, especially in sensitive applications. Further research and evaluation are needed to fully understand and mitigate potential societal impacts.

## How to Get Started with the Model

To get started with inference, you can use the `transformers` library along with the provided `run_llava.py` script from the project's GitHub repository or integrate it directly using Python as shown below.

First, install the necessary packages as described in the [GitHub Installation section](https://github.com/aimagelab/LLaVA-MORE#installation):
```bash
conda create -n more python==3.8.16
conda activate more
pip install -r requirements.txt # Refer to the GitHub repo for the exact requirements.txt
```

**Using the `run_llava.py` script (recommended for full functionality):**

```bash
cd ~/LLaVA-MORE # Navigate to the cloned LLaVA-MORE repository
source activate more
export PYTHONPATH=.

model_path=aimagelab/LLaVA_MORE-phi_4-finetuning # Adjust to the specific model path
model_architecture=llava_phi # Based on config.json
conversation=phi_4 # This might vary based on tokenizer config, check original LLaVA-MORE code for best match

export HF_TOKEN=hf_read_token # Replace with your Hugging Face read token if needed
export TOKENIZER_PATH=$model_path

python -u src/llava/eval/run_llava.py --model-path $model_path --model-architecture $model_architecture --conv-mode $conversation
```

## Training Details

### Training Data

The LLaVA-MORE models are typically trained in two stages:
-   **Pretraining:** On the [LCS-558K dataset](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain).
-   **Finetuning:** On the [LLaVA-Instruct-665K dataset](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K).

### Training Procedure

The training employs a unified protocol consistently applied across all architectures to ensure fair comparisons and enhance reproducibility. The project publicly releases the source code and bash scripts for distributed training on HPC facilities with a SLURM scheduler. More details on the training procedure and hyperparameters can be found in the [Training section of the GitHub repository](https://github.com/aimagelab/LLaVA-MORE#training).

## Evaluation

### Benchmarks and Comparisons on Instruction Multimodal Datasets in the Literature

The table below presents the performance of LLaVA-MORE variants, including this model, compared to other LLaVA versions across various multimodal datasets. For the most up-to-date and complete evaluation results, please refer to the [Performance section in the GitHub repository](https://github.com/aimagelab/LLaVA-MORE#performance).

<div align="center">
<img src="https://huggingface.co/aimagelab/LLaVA_MORE-phi_4-finetuning/resolve/main/images/plot.png" width="500">
</div>

<div align="center">

|       Model Name     |  Text-VQA*  |  Science-QA  |  AI2D  |  SEED-vid  |  SEED-all  |  SEED-img  |  MMMU  |  MMBench-Cn  |  MMBench-En  |  POPE  |  GQA  |   MME-P  |  MME-C  |
|----------------------|:----------: |:------------:|:------:|:----------:|:----------:|:----------:|:------:|:------------:|:------------:|:------:|:-----:|:--------:|:-------:|
|    LLaVA-v1.5-7B                       |    58.2      |     69.0     |  56.4     |    42.0    |    61.6    |    66.8     |  34.2     |      56.5     |      65.3     |  85.6     |  62.4     |  1474.3     |  314.6     |
| LLaVA-v1.5-LLaMA3-8B                   |    57.6      |     74.2     |  60.7     |    42.0    |    64.3    |    70.1     |  37.3     |      65.4     |      70.3     |  85.4     |  63.5     |  1544.4     |  330.3     |
|  **LLaVA-v1.5-LLaMA3_1-8B**            |    58.4      |     76.3     |  61.8     |    42.4    |    64.1    |    69.8     |  39.4     |      **68.2** |      72.4     |  85.1     |  63.6     |  1531.5     |  **353.3** |
|  **LLaVA-v1.5-LLaMA3_1-8B-S2**         |    60.9      |     76.7     |  62.2     |    42.3    |    64.2    |    69.9     |  38.7     |      65.8     |      71.1     |  86.5     |  64.5     |  **1563.8** |  293.2     |
|  **LLaVA-v1.5-LLaMA3_1-8B-siglip**     |    62.1      |     **77.5** |  63.6     |  **46.1**  |    65.8    |    71.0     |  39.8     |      **68.2** |      **73.1** |  86.1     |  64.6     |  1531.0     |  315.4     |
|  **LLaVA-v1.5-LLaMA3_1-8B-S2-siglip**  |    63.5      |     77.1     |  62.7     |    44.7    |    65.5    |    71.0     |  **40.0** |      68.0     |      71.8     |  86.0     |  64.9     |  1541.4     |  336.4     |
|  **LLaVA-v1.5-Phi_4-4B**               |    54.0      |     71.3     |  61.1     |    42.3    |    63.5    |    69.1     |  38.8     |      64.2     |      69.2     |  85.9     |  62.1     |  1372.2     |  281.1     |
|  **LLaVA-v1.5-gemma_2-9B**             |    60.7      |     75.4     |  64.8     |    44.1    |    64.5    |    69.9     |  37.9     |      65.9     |      71.9     |  **86.8** |  64.2     |  1522.5     |  307.5     |
|  **LLaVA-v1.5-gemma_2-9B-siglip2**     |    **66.7**  |     76.2     |  **65.3** |    46.0    |   **67.5** |    **73.1** |  38.7     |      68.0     |      72.0     |  86.1     |  **65.6** |  1510.9     |  308.2     |
|  **LLaVA-v1.5-Distill-LLaMA-8B**       |    56.3      |     74.5     |  58.8     |    43.5    |    63.5    |    68.6     |  38.1     |      66.8     |      61.3     |  85.1     |  63.0     |  1495.1     |  295.0     |


</div>

\* The results of TextVQA are computed with OCR token in the input prompt. **The models in bold represent LLaVA-MORE variants.**

## Checkpoints

For a complete list of all LLaVA-MORE checkpoints, you can refer to the [Hugging Face model collection](https://huggingface.co/collections/aimagelab/llava-more-66aa6c49167e190bf27e7be4).

## Acknowledgments
We thank the [LLaVA](https://github.com/haotian-liu/LLaVA.git) team for open-sourcing a modular codebase to extend and train different models within the LLaVA family. We are also happy users of the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval.git) library, which has significantly reduced the evaluation time of our checkpoints across different datasets.

We also thank [CINECA](https://www.hpc.cineca.it/systems/hardware/leonardo/) for the availability of high-performance computing resources used to train LLaVA-MORE. This work is supported by the PNRR-M4C2 project [FAIR - Future Artificial Intelligence Research](https://fondazione-fair.it/) and by the PNRR project [ITSERR - Italian Strengthening of Esfri RI Resilience](https://www.itserr.it/).