ImageBind: One Embedding Space To Bind Them All
Abstract
We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval (2025)
- Data-Efficient Generalization for Zero-shot Composed Image Retrieval (2025)
- LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning (2025)
- Occlusion-aware Text-Image-Point Cloud Pretraining for Open-World 3D Object Recognition (2025)
- LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps (2025)
- The Power of One: A Single Example is All it Takes for Segmentation in VLMs (2025)
- MMRL: Multi-Modal Representation Learning for Vision-Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper