arxiv:2412.10302

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Published on Dec 13, 2024

Upvote

Authors:

Xiaokang Chen ,

Wen Liu ,

Jiawei Wang ,

Abstract

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.

View arXiv page View PDF Add to collection

Community

LukasN

Dec 20, 2024

Give me the text from this picture

Drxekus

Dec 23, 2024

This comment has been hidden

Drxekus

Dec 23, 2024

This comment has been hidden

Ayaan-Sharif

Dec 25, 2024

guys well how to input videos here

can i use the normal pytorch tenosr and input it here or theres this framework of internvl2 that takes videos turn into frames and then process those frames here(1 frame per second )

thanks guys ur work are very nice

also hwo to find ur discord or any place u guys yap !