Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models Paper ⢠2411.04996 ⢠Published Nov 7, 2024 ⢠52
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents Paper ⢠2410.10594 ⢠Published Oct 14, 2024 ⢠27
UI Agent Collection a collection of algorithmic agents for user interfaces/interactions, program synthesis, and robotics ⢠357 items ⢠Updated about 23 hours ago ⢠52
GUICourse: From General Vision Language Models to Versatile GUI Agents Paper ⢠2406.11317 ⢠Published Jun 17, 2024 ⢠1
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Paper ⢠2403.11703 ⢠Published Mar 18, 2024 ⢠17
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Paper ⢠2406.18521 ⢠Published Jun 26, 2024 ⢠30
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper ⢠2405.21075 ⢠Published May 31, 2024 ⢠24
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation Paper ⢠2405.14598 ⢠Published May 23, 2024 ⢠14
RoHM: Robust Human Motion Reconstruction via Diffusion Paper ⢠2401.08570 ⢠Published Jan 16, 2024 ⢠1
MultiBooth: Towards Generating All Your Concepts in an Image from Text Paper ⢠2404.14239 ⢠Published Apr 22, 2024 ⢠9
Chameleon: Mixed-Modal Early-Fusion Foundation Models Paper ⢠2405.09818 ⢠Published May 16, 2024 ⢠132
What matters when building vision-language models? Paper ⢠2405.02246 ⢠Published May 3, 2024 ⢠104