Caption Anything: Interactive Image Description with Diverse Multimodal Controls Paper • 2305.02677 • Published May 4, 2023
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning Paper • 2307.16525 • Published Jul 31, 2023
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos Paper • 2411.19772 • Published Nov 29, 2024
TIIF-Bench: How Does Your T2I Model Follow Your Instructions? Paper • 2506.02161 • Published Jun 2 • 12