MM-IFEngine: Towards Multimodal Instruction Following Paper • 2504.07957 • Published 12 days ago • 34
GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography Paper • 2504.07083 • Published 13 days ago • 23
RelightVid: Temporal-Consistent Diffusion Model for Video Relighting Paper • 2501.16330 • Published Jan 27 • 2
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference Paper • 2502.18411 • Published Feb 25 • 73
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning Paper • 2501.03226 • Published Jan 6 • 45
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models Paper • 2412.07674 • Published Dec 10, 2024 • 20
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models Paper • 2412.01824 • Published Dec 2, 2024 • 66
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models Paper • 2410.17637 • Published Oct 23, 2024 • 37
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction Paper • 2410.17247 • Published Oct 22, 2024 • 48
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree Paper • 2410.16268 • Published Oct 21, 2024 • 69
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Paper • 2407.03320 • Published Jul 3, 2024 • 96
Make-it-Real: Unleashing Large Multimodal Model's Ability for Painting 3D Objects with Realistic Materials Paper • 2404.16829 • Published Apr 25, 2024 • 5
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model Paper • 2401.16420 • Published Jan 29, 2024 • 56
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation Paper • 2401.04092 • Published Jan 8, 2024 • 22
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases Paper • 2312.15011 • Published Dec 22, 2023 • 18
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want Paper • 2312.03818 • Published Dec 6, 2023 • 34
GPT4Point: A Unified Framework for Point-Language Understanding and Generation Paper • 2312.02980 • Published Dec 5, 2023 • 10