Packing Input Frame Context in Next-Frame Prediction Models for Video Generation Paper • 2504.12626 • Published 6 days ago • 45
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models Paper • 2504.13122 • Published 5 days ago • 21
FlexIP: Dynamic Control of Preservation and Personality for Customized Image Generation Paper • 2504.07405 • Published 13 days ago • 11
ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration Paper • 2504.08591 • Published 11 days ago • 18
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft Paper • 2504.08388 • Published 11 days ago • 39
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation Paper • 2504.08736 • Published 11 days ago • 47
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model Paper • 2504.08685 • Published 11 days ago • 120
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning Paper • 2504.09641 • Published 9 days ago • 15
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper • 2504.10479 • Published 8 days ago • 239
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding Paper • 2504.09925 • Published 8 days ago • 38
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning Paper • 2504.07960 • Published 12 days ago • 45
DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion Paper • 2504.04010 • Published 18 days ago • 10
FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis Paper • 2504.04842 • Published 15 days ago • 32
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22, 2024 • 130