|
|
|
Image tasks with IDEFICS |
|
[[open-in-colab]] |
|
While individual tasks can be tackled by fine-tuning specialized models, an alternative approach |
|
that has recently emerged and gained popularity is to use large models for a diverse set of tasks without fine-tuning. |
|
For instance, large language models can handle such NLP tasks as summarization, translation, classification, and more. |
|
This approach is no longer limited to a single modality, such as text, and in this guide, we will illustrate how you can |
|
solve image-text tasks with a large multimodal model called IDEFICS. |
|
IDEFICS is an open-access vision and language model based on Flamingo, |
|
a state-of-the-art visual language model initially developed by DeepMind. The model accepts arbitrary sequences of image |
|
and text inputs and generates coherent text as output. It can answer questions about images, describe visual content, |
|
create stories grounded in multiple images, and so on. IDEFICS comes in two variants - 80 billion parameters |
|
and 9 billion parameters, both of which are available on the 🤗 Hub. For each variant, you can also find fine-tuned instructed |
|
versions of the model adapted for conversational use cases. |
|
This model is exceptionally versatile and can be used for a wide range of image and multimodal tasks. However, |
|
being a large model means it requires significant computational resources and infrastructure. It is up to you to decide whether |
|
this approach suits your use case better than fine-tuning specialized models for each individual task. |
|
In this guide, you'll learn how to: |
|
- Load IDEFICS and load the quantized version of the model |
|
- Use IDEFICS for: |
|
- Image captioning |
|
- Prompted image captioning |
|
- Few-shot prompting |
|
- Visual question answering |
|
- Image classification |
|
- Image-guided text generation |
|
- Run inference in batch mode |
|
- Run IDEFICS instruct for conversational use |
|
Before you begin, make sure you have all the necessary libraries installed. |
|
|
|
pip install -q bitsandbytes sentencepiece accelerate transformers |
|
|
|
To run the following examples with a non-quantized version of the model checkpoint you will need at least 20GB of GPU memory. |
|
|
|
Loading the model |
|
Let's start by loading the model's 9 billion parameters checkpoint: |
|
|
|
checkpoint = "HuggingFaceM4/idefics-9b" |
|
|
|
Just like for other Transformers models, you need to load a processor and the model itself from the checkpoint. |
|
The IDEFICS processor wraps a [LlamaTokenizer] and IDEFICS image processor into a single processor to take care of |
|
preparing text and image inputs for the model. |
|
|
|
import torch |
|
from transformers import IdeficsForVisionText2Text, AutoProcessor |
|
processor = AutoProcessor.from_pretrained(checkpoint) |
|
model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto") |
|
|
|
Setting device_map to "auto" will automatically determine how to load and store the model weights in the most optimized |
|
manner given existing devices. |
|
Quantized model |
|
If high-memory GPU availability is an issue, you can load the quantized version of the model. To load the model and the |
|
processor in 4bit precision, pass a BitsAndBytesConfig to the from_pretrained method and the model will be compressed |
|
on the fly while loading. |
|
|
|
import torch |
|
from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig |
|
quantization_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_compute_dtype=torch.float16, |
|
) |
|
processor = AutoProcessor.from_pretrained(checkpoint) |
|
model = IdeficsForVisionText2Text.from_pretrained( |
|
checkpoint, |
|
quantization_config=quantization_config, |
|
device_map="auto" |
|
) |
|
|
|
Now that you have the model loaded in one of the suggested ways, let's move on to exploring tasks that you can use IDEFICS for. |
|
Image captioning |
|
Image captioning is the task of predicting a caption for a given image. A common application is to aid visually impaired |
|
people navigate through different situations, for instance, explore image content online. |
|
To illustrate the task, get an image to be captioned, e.g.: |
|
|
|
Photo by Hendo Wang. |
|
IDEFICS accepts text and image prompts. However, to caption an image, you do not have to provide a text prompt to the |
|
model, only the preprocessed input image. Without a text prompt, the model will start generating text from the |
|
BOS (beginning-of-sequence) token thus creating a caption. |
|
As image input to the model, you can use either an image object (PIL.Image) or a url from which the image can be retrieved. |
|
|
|
prompt = [ |
|
"https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80", |
|
] |
|
inputs = processor(prompt, return_tensors="pt").to("cuda") |
|
bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids |
|
generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
print(generated_text[0]) |
|
A puppy in a flower bed |
|
|
|
It is a good idea to include the bad_words_ids in the call to generate to avoid errors arising when increasing |
|
the max_new_tokens: the model will want to generate a new <image> or <fake_token_around_image> token when there |
|
is no image being generated by the model. |
|
You can set it on-the-fly as in this guide, or store in the GenerationConfig as described in the Text generation strategies guide. |
|
|
|
Prompted image captioning |
|
You can extend image captioning by providing a text prompt, which the model will continue given the image. Let's take |
|
another image to illustrate: |
|
|
|
Photo by Denys Nevozhai. |
|
Textual and image prompts can be passed to the model's processor as a single list to create appropriate inputs. |
|
|
|
prompt = [ |
|
"https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", |
|
"This is an image of ", |
|
] |
|
inputs = processor(prompt, return_tensors="pt").to("cuda") |
|
bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids |
|
generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
print(generated_text[0]) |
|
This is an image of the Eiffel Tower in Paris, France. |
|
|
|
Few-shot prompting |
|
While IDEFICS demonstrates great zero-shot results, your task may require a certain format of the caption, or come with |
|
other restrictions or requirements that increase task's complexity. Few-shot prompting can be used to enable in-context learning. |
|
By providing examples in the prompt, you can steer the model to generate results that mimic the format of given examples. |
|
Let's use the previous image of the Eiffel Tower as an example for the model and build a prompt that demonstrates to the model |
|
that in addition to learning what the object in an image is, we would also like to get some interesting information about it. |
|
Then, let's see, if we can get the same response format for an image of the Statue of Liberty: |
|
|
|
Photo by Juan Mayobre. |
|
|
|
prompt = ["User:", |
|
"https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", |
|
"Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n", |
|
"User:", |
|
"https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80", |
|
"Describe this image.\nAssistant:" |
|
] |
|
inputs = processor(prompt, return_tensors="pt").to("cuda") |
|
bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids |
|
generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
print(generated_text[0]) |
|
User: Describe this image. |
|
Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. |
|
User: Describe this image. |
|
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall. |
|
|
|
Notice that just from a single example (i.e., 1-shot) the model has learned how to perform the task. For more complex tasks, |
|
feel free to experiment with a larger number of examples (e.g., 3-shot, 5-shot, etc.). |
|
Visual question answering |
|
Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. Similar to image |
|
captioning it can be used in accessibility applications, but also in education (reasoning about visual materials), customer |
|
service (questions about products based on images), and image retrieval. |
|
Let's get a new image for this task: |
|
|
|
Photo by Jarritos Mexican Soda. |
|
You can steer the model from image captioning to visual question answering by prompting it with appropriate instructions: |
|
|
|
prompt = [ |
|
"Instruction: Provide an answer to the question. Use the image to answer.\n", |
|
"https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", |
|
"Question: Where are these people and what's the weather like? Answer:" |
|
] |
|
inputs = processor(prompt, return_tensors="pt").to("cuda") |
|
bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids |
|
generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
print(generated_text[0]) |
|
Instruction: Provide an answer to the question. Use the image to answer. |
|
Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day. |
|
|
|
Image classification |
|
IDEFICS is capable of classifying images into different categories without being explicitly trained on data containing |
|
labeled examples from those specific categories. Given a list of categories and using its image and text understanding |
|
capabilities, the model can infer which category the image likely belongs to. |
|
Say, we have this image of a vegetable stand: |
|
|
|
Photo by Peter Wendt. |
|
We can instruct the model to classify the image into one of the categories that we have: |
|
|
|
categories = ['animals','vegetables', 'city landscape', 'cars', 'office'] |
|
prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n", |
|
"https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", |
|
"Category: " |
|
] |
|
inputs = processor(prompt, return_tensors="pt").to("cuda") |
|
bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids |
|
generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
print(generated_text[0]) |
|
Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office']. |
|
Category: Vegetables |
|
``` |
|
|
|
In the example above we instruct the model to classify the image into a single category, however, you can also prompt the model to do rank classification. |
|
Image-guided text generation |
|
For more creative applications, you can use image-guided text generation to generate text based on an image. This can be |
|
useful to create descriptions of products, ads, descriptions of a scene, etc. |
|
Let's prompt IDEFICS to write a story based on a simple image of a red door: |
|
|
|
Photo by Craig Tidball. |
|
|
|
prompt = ["Instruction: Use the image to write a story. \n", |
|
"https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80", |
|
"Story: \n"] |
|
inputs = processor(prompt, return_tensors="pt").to("cuda") |
|
bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids |
|
generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
print(generated_text[0]) |
|
Instruction: Use the image to write a story. |
|
Story: |
|
Once upon a time, there was a little girl who lived in a house with a red door. She loved her red door. It was the prettiest door in the whole world. |
|
|
|
One day, the little girl was playing in her yard when she noticed a man standing on her doorstep. He was wearing a long black coat and a top hat. |
|
The little girl ran inside and told her mother about the man. |
|
Her mother said, “Don’t worry, honey. He’s just a friendly ghost.” |
|
The little girl wasn’t sure if she believed her mother, but she went outside anyway. |
|
When she got to the door, the man was gone. |
|
The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep. |
|
He was wearing a long black coat and a top hat. |
|
The little girl ran |
|
|
|
Looks like IDEFICS noticed the pumpkin on the doorstep and went with a spooky Halloween story about a ghost. |
|
|
|
For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help |
|
you significantly improve the quality of the generated output. Check out Text generation strategies |
|
to learn more. |
|
|
|
Running inference in batch mode |
|
All of the earlier sections illustrated IDEFICS for a single example. In a very similar fashion, you can run inference |
|
for a batch of examples by passing a list of prompts: |
|
|
|
prompts = [ |
|
[ "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", |
|
"This is an image of ", |
|
], |
|
[ "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", |
|
"This is an image of ", |
|
], |
|
[ "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", |
|
"This is an image of ", |
|
], |
|
] |
|
inputs = processor(prompts, return_tensors="pt").to("cuda") |
|
bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids |
|
generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
for i,t in enumerate(generated_text): |
|
print(f"{i}:\n{t}\n") |
|
0: |
|
This is an image of the Eiffel Tower in Paris, France. |
|
|
|
1: |
|
This is an image of a couple on a picnic blanket. |
|
2: |
|
This is an image of a vegetable stand. |
|
|
|
IDEFICS instruct for conversational use |
|
For conversational use cases, you can find fine-tuned instructed versions of the model on the 🤗 Hub: |
|
HuggingFaceM4/idefics-80b-instruct and HuggingFaceM4/idefics-9b-instruct. |
|
These checkpoints are the result of fine-tuning the respective base models on a mixture of supervised and instruction |
|
fine-tuning datasets, which boosts the downstream performance while making the models more usable in conversational settings. |
|
The use and prompting for the conversational use is very similar to using the base models: |
|
|
|
import torch |
|
from transformers import IdeficsForVisionText2Text, AutoProcessor |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
checkpoint = "HuggingFaceM4/idefics-9b-instruct" |
|
model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device) |
|
processor = AutoProcessor.from_pretrained(checkpoint) |
|
prompts = [ |
|
[ |
|
"User: What is in this image?", |
|
"https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG", |
|
"", |
|
|
|
"\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.", |
|
"\nUser:", |
|
"https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052", |
|
"And who is that?", |
|
"\nAssistant:", |
|
], |
|
] |
|
|
|
--batched mode |
|
inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device) |
|
--single sample mode |
|
inputs = processor(prompts[0], return_tensors="pt").to(device) |
|
Generation args |
|
exit_condition = processor.tokenizer("", add_special_tokens=False).input_ids |
|
bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids |
|
generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
for i, t in enumerate(generated_text): |
|
print(f"{i}:\n{t}\n") |
|
|
|
|