Submitted by minghaowu 50 The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks · 10 authors 1
Submitted by longlian 39 Describe Anything: Detailed Localized Image and Video Captioning · 11 authors 2
Submitted by Neph0s 17 BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation · 6 authors 1
Submitted by zhangysk 15 IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs · 20 authors 1
Submitted by yueyang2000 14 CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning · 6 authors 1
Submitted by Kaiyue 13 Personalized Text-to-Image Generation with Auto-Regressive Models · 4 authors 2
Submitted by chenjoya 11 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale · 6 authors 1
Submitted by Zilence006 11 Vidi: Large Multimodal Models for Video Understanding and Editing · 22 authors 1
Submitted by thomasschmied 10 LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities · 5 authors 1
Submitted by zhoutianyi 8 WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents · 7 authors 3
Submitted by sayakpaul 6 From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning · 9 authors 1
Submitted by theFoxofSky 5 RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild · 8 authors 1
Submitted by ziqipang 3 MR. Video: "MapReduce" is the Principle for Long Video Understanding · 2 authors 1
Submitted by QiYao-Wang 3 IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property · 23 authors 1
Submitted by j-min 3 CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting · 4 authors 1
Submitted by yoyolicoris 1 DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions · 7 authors 1