--- license: apache-2.0 language: - en base_model: - ibm-granite/granite-vision-3.3-2b library_name: transformers --- # granite-vision-3.3-2b-embedding **Model Summary:** Granite-vision-3.3-2b-embedding is an efficient embedding model based on [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b). This model is specifically designed for multimodal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layouts. The model generates ColBERT-style multi-vector representations of pages. By removing the need for OCR-based text extractions, granite-vision-3.3-2b-embedding can help simplify and accelerate RAG pipelines. **Evaluations:** We evaluated granite-vision-3.3-2b-embedding alongside other top colBERT style multi-modal embedding models in the 1B-4B parameter range using two benchmark: [Vidore2](https://github.com/illuin-tech/vidore-benchmark/) and [Real-MM-RAG-Bench](https://huggingface.co/collections/ibm-research/real-mm-rag-bench-67d2dc0ddf2dfafe66f09d34) which aim to specifically address complex multimodal document retrieval tasks. ## **NDCG@5 - ViDoRe V2** | Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding | |----------------------------------------|--------------|------------------|-------------|-------------------|----------- | ESG Restaurant Human | 51.1 | 68.4 | 65.8 | 62.4 | 62.3 | | Economics Macro Multilingual | 49.9 | 56.5 | 55.4 | 47.4 | 48.3 | | MIT Biomedical | 59.7 | 63.6 | 63.5 | 58.1 |60.0 | | ESG Restaurant Synthetic | 57.0 | 57.4 | 56.6 | 51.1 |54.0 | | ESG Restaurant Synthetic Multilingual | 55.7 | 57.4 | 57.2 | 47.6 |53.5 | | MIT Biomedical Multilingual | 56.5 | 61.1 | 62.5 | 50.5 | 53.6 | | Economics Macro | 51.6 | 59.8 | 60.2 | 60.9 |60.0 | | **Avg (ViDoRe2)** | **54.5** | **60.6** | **60.2** | **54.0** |**56.0** | ## **NDCG@5 - REAL-MM-RAG** | Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding | |----------------------------------------|--------------|------------------|-------------|--------------------------| ------------------ | FinReport | 55 | 66 | 78 | 65 |70 | FinSlides | 68 | 79 | 81 | 55 |74 | TechReport | 78 | 86 | 88 | 83 |84 | TechSlides | 90 | 93 | 92 | 91 |93 | **Avg (REAL-MM-RAG)** | **73** | **81** | **85** | **74** |**80** - **Release Date**: June 11th 2025 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) - **Supported Input Format:** Currently the model supports English instructions and images (png, jpeg) as input format. **Intended Use:** The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever. ### Usage ```shell pip install -q torch torchvision torchaudio pip install transformers==4.50 ``` Then run the code: ```python from io import BytesIO import requests import torch from PIL import Image from transformers import AutoProcessor, AutoModel device = "cuda" if torch.cuda.is_available() else "cpu" model_name = "ibm-granite/granite-vision-3.3-2b-embedding" model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to(device).eval() processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) # ───────────────────────────────────────────── # Inputs: Image + Text # ───────────────────────────────────────────── image_url = "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg" print("\nFetching image...") image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB") text = "A photo of a tiger" print(f"Image and text inputs ready.") # Process both inputs print("Processing inputs...") image_inputs = processor.process_images([image]) text_inputs = processor.process_queries([text]) # Move to correct device image_inputs = {k: v.to(device) for k, v in image_inputs.items()} text_inputs = {k: v.to(device) for k, v in text_inputs.items()} # ───────────────────────────────────────────── # Run Inference # ───────────────────────────────────────────── with torch.no_grad(): print("🔍 Getting image embedding...") img_emb = model(**image_inputs) print("✍️ Getting text embedding...") txt_emb = model(**text_inputs) # ───────────────────────────────────────────── # Score the similarity # ───────────────────────────────────────────── print("Scoring similarity...") similarity = processor.score(txt_emb, img_emb, batch_size=1, device=device) print("\n" + "=" * 50) print(f"📊 Similarity between image and text: {similarity.item():.4f}") print("=" * 50) ``` ### Use granite-vision-embedding-3.3-2b for MM RAG For an example of MM-RAG using granite-vision-3.3-2b-embedding refer to [this notebook](https://github.com/ibm-granite/granite-vision-models/blob/main/cookbooks/GraniteVisionEmbedding_MM-RAG_Notebook.ipynb). **Model Architecture:** The architecture of granite-vision-3.3-2b-embedding follows ColPali(https://arxiv.org/abs/2407.01449) approach and consists of the following components: (1) Vision-Language model : granite-vision-3.3-2b (https://huggingface.co/ibm-granite/granite-vision-3.3-2b). (2) Projection layer: linear layer that projects the hidden layer dimension of Vision-Language model to 128 and outputs 729 embedding vectors per image. The scoring is computed using MaxSim-based late interaction mechanism. **Training Data:** Our training data is entirly comprised from DocFM. DocFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance) reports. **Infrastructure:** We train granite-vision-3.3-2b-embedding on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs. **Ethical Considerations and Limitations:** The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-vision-3.3-2b-embedding is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use granite-vision-3.3-2b-embedding with ethical intentions and in a responsible way. **Resources** - 📄 Granite Vision technical report [here](https://arxiv.org/abs/2502.09927) - 📄 Real-MM-RAG-Bench paper (ACL 2025) [here](https://arxiv.org/abs/2502.12342) - 📄 Vidore 2 paper [here](https://www.arxiv.org/pdf/2505.17166) - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 🚀 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources