FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Abstract
We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION
Community
Github Code: https://github.com/starriver030515/FUSION
Model: starriver030515/FUSION-Model
Dataset: starriver030515/FUSION-Data
Model | # Vis Tok. | MMB_EN | MMB_CN | VizWiz | POPE | MM-Vet | MME_P | MME_C | Seed-Image | HallB | LLaVA_W | MMStar | MME-RW | RWQA | CV-Bench | MMVP | AI2D | MathVista | MMMU | SQA | TextVQA | OCRBench | ChartQA | DocVQA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<=4B Model Comparison | ||||||||||||||||||||||||
Qwen2.5VL 3B | - | 79.1 | 78.1 | - | 85.9 | 61.4 | 1592.4 | 607.5 | 74.0 | 46.6 | - | 56.3 | 53.1 | 65.4 | - | - | 81.4 | 61.2 | 51.2 | 79.3 | - | 82.8 | 84.0 | 93.93 |
InternVL2 4B | - | 78.5 | 73.9 | - | 84.6 | 50.5 | 1532.8 | 531.8 | 73.2 | 42.4 | - | 53.9 | 52.1 | 60.5 | - | - | 79.0 | 58.5 | 48.3 | 96.0 | 74.7 | 78.4 | 81.5 | 89.2 |
DeepSeek-VL2-Tiny | - | 74.6 | 72.1 | - | - | 52.5 | 1548.3 | 357.1 | 72.3 | 39.6 | - | 45.9 | - | 64.2 | - | - | 71.6 | 53.6 | 40.7 | - | 80.7 | 80.5 | 81.0 | 86.9 |
MM1.5 3B | - | - | - | - | 88.1 | 41.0 | 1478.4 | 319.6 | 72.4 | - | 73.0 | - | - | 56.9 | - | - | 65.7 | 44.4 | 37.1 | 85.8 | 76.5 | 65.7 | 74.2 | 87.5 |
Phi 3.5-Vision | - | 75.5 | 64.2 | 58.2 | 82.2 | 46.5 | 1473.4 | 412.1 | 69.9 | 53.3 | 68.8 | 49.0 | - | 53.5 | 69.3 | 67.7 | 77.4 | - | 43.3 | 89.0 | 61.1 | 59.8 | 72.0 | 75.9 |
Florence-VL 3B | 576 | 71.6 | 60.8 | 59.1 | 88.3 | 51.0 | 1498.7 | 403.9 | 70.6 | 58.1 | 71.1 | 44.9 | - | 60.4 | 70.2 | 64.7 | 73.8 | 52.2 | 41.8 | 84.6 | 69.1 | 63.0 | 70.7 | - |
FUSION 3B (ours) | 780 | 79.5 | 71.7 | 64.6 | 88.9 | 57.2 | 1595.9 | 416.5 | 74.6 | 51.4 | 84.7 | 52.4 | 41.5 | 65.1 | 76.4 | 76.0 | 78.9 | 54.3 | 44.7 | 87.1 | 71.8 | 60.0 | 75.7 | 70.9 |
FUSION-X 3B (ours) | 620 | 80.3 | 74.8 | 66.1 | 88.7 | 60.3 | 1582.1 | 440.0 | 75.3 | 51.9 | 85.2 | 50.9 | 41.7 | 63.7 | 78.3 | 78.1 | 79.2 | 54.9 | 44.2 | 87.3 | 73.9 | 63.7 | 75.8 | 71.1 |
FUSION-L 3B (ours) | 308 | 77.6 | 70.8 | 65.3 | 88.3 | 56.7 | 1573.7 | 406.8 | 74.1 | 48.7 | 77.6 | 44.7 | 39.5 | 61.8 | 76.2 | 77.0 | 77.3 | 48.6 | 43.4 | 85.6 | 71.4 | 56.9 | 67.7 | 63.5 |
>=7B Model Comparison | ||||||||||||||||||||||||
Qwen2VL 7B | - | 83.0 | 80.5 | - | 88.4 | 62.0 | 1639.2 | 637.1 | 76.0 | 50.6 | - | 60.7 | 57.4 | 70.1 | - | - | 83.0 | 58.2 | 54.1 | 85.5 | 84.3 | 86.6 | 83.0 | 94.5 |
InternVL2 8B | - | 81.7 | 81.2 | - | 86.9 | 54.2 | 1639.7 | 575.3 | 75.4 | 45.2 | - | 61.5 | 53.5 | 64.4 | - | - | 83.6 | 58.3 | 52.6 | 96.3 | 77.4 | 79.4 | 83.3 | 91.6 |
LLaVA-OneVision 8B | - | 81.7 | 78.0 | - | 87.2 | 58.8 | 1626.0 | 483.0 | 74.8 | 47.5 | 86.9 | 60.9 | 57.5 | 65.5 | - | - | 81.6 | 56.1 | 47.7 | 96.6 | 78.5 | 69.7 | 78.8 | 87.5 |
MM1.5 7B | - | - | - | - | 88.6 | 42.2 | 1514.9 | 346.4 | 73.4 | - | 74.2 | - | - | 62.5 | - | - | 72.2 | 47.6 | 41.8 | 89.6 | 76.5 | 63.5 | 88.1 | 78.2 |
Cambrian 8B | 576 | 75.9 | 67.9 | - | 87.4 | 48.0 | 1547.1 | - | 74.7 | 48.7 | 71.0 | 50.0 | - | 64.2 | 72.2 | 51.3 | 73.0 | 49.0 | 42.7 | 80.4 | 71.7 | 62.4 | 73.3 | 77.8 |
Florence-VL 8B | 576 | 76.2 | 69.5 | 59.1 | 89.9 | 56.3 | 1560.0 | 381.1 | 74.9 | 57.3 | 74.2 | 50.0 | - | 64.2 | 73.4 | 73.3 | 74.2 | 55.5 | 43.7 | 85.9 | 74.2 | 63.4 | 74.7 | - |
Eagle 8B | 1024 | 75.9 | - | - | - | - | 1559.0 | - | 76.3 | - | - | - | - | 66.5 | - | 71.6 | 76.1 | 52.7 | 43.8 | 84.3 | 77.1 | 62.6 | 80.1 | 86.6 |
FUSION 8B (ours) | 780 | 80.5 | 74.9 | 59.5 | 89.3 | 60.0 | 1592.3 | 396.1 | 77.2 | 52.6 | 86.9 | 52.4 | 46.0 | 65.2 | 78.7 | 78.7 | 80.4 | 56.6 | 43.1 | 89.2 | 77.3 | 63.8 | 80.3 | 78.6 |
FUSION-X 8B (ours) | 620 | 82.0 | 76.2 | 62.9 | 88.8 | 60.0 | 1607.5 | 337.2 | 78.2 | 51.4 | 88.0 | 52.7 | 44.7 | 66.1 | 79.2 | 79.9 | 81.4 | 59.4 | 42.2 | 90.3 | 74.7 | 66.6 | 79.8 | 77.8 |
FUSION-L 8B (ours) | 308 | 80.0 | 73.6 | 59.9 | 88.5 | 57.3 | 1601.7 | 338.9 | 75.9 | 46.7 | 82.1 | 49.3 | 42.3 | 65.1 | 78.2 | 76.7 | 79.2 | 55.2 | 41.8 | 88.3 | 72.8 | 59.5 | 73.0 | 66.0 |
With only 630 vision tokens, FUSION-X outperforms Cambrian-1 and Florence-VL, matching LLaVA-OneVision and nearly reaching the performance of top models like InternVL2 and Qwen2VL. Even with 300 vision tokens, FUSION-L retains 95% of its original performance, staying on par with Florence-VL.
Notably, FUSION-X 3B achieved the highest score on MMBench among models under 4B in size, even surpassing Qwen2.5VL 3B!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering (2025)
- BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries (2025)
- The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer (2025)
- Aligning Vision to Language: Text-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning (2025)
- Breaking the Encoder Barrier for Seamless Video-Language Understanding (2025)
- Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs (2025)
- Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper