Papers
arxiv:2412.10302

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Published on Dec 13, 2024
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.

Community

Give me the text from this picture

This comment has been hidden
This comment has been hidden

guys well how to input videos here

can i use the normal pytorch tenosr and input it here or theres this framework of internvl2 that takes videos turn into frames and then process those frames here(1 frame per second )

thanks guys ur work are very nice

also hwo to find ur discord or any place u guys yap !

This comment has been hidden
This comment has been hidden

1738117434034.jpg

This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

This comment has been hidden (marked as Off-Topic)
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 12

Browse 12 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 22

Collections including this paper 10