VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning Paper • 2504.07960 • Published 10 days ago • 40
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step Paper • 2504.01956 • Published 18 days ago • 39
view article Article Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM Mar 12 • 392
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer Paper • 2503.07027 • Published Mar 10 • 28
microsoft/Phi-4-multimodal-instruct Automatic Speech Recognition • Updated 11 days ago • 634k • 1.31k
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation Paper • 2502.13128 • Published Feb 18 • 41
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation Paper • 2502.07870 • Published Feb 11 • 44
On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices Paper • 2502.04363 • Published Feb 5 • 12
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces Paper • 2501.12909 • Published Jan 22 • 70
FashionComposer: Compositional Fashion Image Generation Paper • 2412.14168 • Published Dec 18, 2024 • 16