DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Abstract
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
Community
Give me the text from this picture
guys well how to input videos here
can i use the normal pytorch tenosr and input it here or theres this framework of internvl2 that takes videos turn into frames and then process those frames here(1 frame per second )
thanks guys ur work are very nice
also hwo to find ur discord or any place u guys yap !
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow (2025)
- EVEv2: Improved Baselines for Encoder-Free Vision-Language Models (2025)
- Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI (2025)
- Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments (2025)
- MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing (2025)
- PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models (2025)
- SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend