Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
HeartofSheep 's Collections
3D
diffusion
EmbodiedAI
VLMs
Image Representation
LLMs

VLMs

updated Jul 3
Upvote
1

  • DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

    Paper • 2503.12797 • Published Mar 17 • 32

  • CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

    Paper • 2503.12329 • Published Mar 16 • 26

  • GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

    Paper • 2503.10639 • Published Mar 13 • 53

  • SmolVLM: Redefining small and efficient multimodal models

    Paper • 2504.05299 • Published Apr 7 • 198

  • Kimi-VL Technical Report

    Paper • 2504.07491 • Published Apr 10 • 134

  • Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

    Paper • 2505.04921 • Published May 8 • 185

  • Seed1.5-VL Technical Report

    Paper • 2505.07062 • Published May 11 • 150

  • MMaDA: Multimodal Large Diffusion Language Models

    Paper • 2505.15809 • Published May 21 • 96

  • OmniGen2: Exploration to Advanced Multimodal Generation

    Paper • 2506.18871 • Published Jun 23 • 75
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs