Spaces:

Ahmadzei
/

RAG

Runtime error

App Files Files Community

RAG / knowledge_base /tasks_idefics.txt

Ahmadzei

update 1

57bdca5 over 1 year ago

raw

history blame contribute delete

17.1 kB


	Image tasks with IDEFICS
	[[open-in-colab]]
	While individual tasks can be tackled by fine-tuning specialized models, an alternative approach
	that has recently emerged and gained popularity is to use large models for a diverse set of tasks without fine-tuning.
	For instance, large language models can handle such NLP tasks as summarization, translation, classification, and more.
	This approach is no longer limited to a single modality, such as text, and in this guide, we will illustrate how you can
	solve image-text tasks with a large multimodal model called IDEFICS.
	IDEFICS is an open-access vision and language model based on Flamingo,
	a state-of-the-art visual language model initially developed by DeepMind. The model accepts arbitrary sequences of image
	and text inputs and generates coherent text as output. It can answer questions about images, describe visual content,
	create stories grounded in multiple images, and so on. IDEFICS comes in two variants - 80 billion parameters
	and 9 billion parameters, both of which are available on the 🤗 Hub. For each variant, you can also find fine-tuned instructed
	versions of the model adapted for conversational use cases.
	This model is exceptionally versatile and can be used for a wide range of image and multimodal tasks. However,
	being a large model means it requires significant computational resources and infrastructure. It is up to you to decide whether
	this approach suits your use case better than fine-tuning specialized models for each individual task.
	In this guide, you'll learn how to:
	- Load IDEFICS and load the quantized version of the model
	- Use IDEFICS for:
	- Image captioning
	- Prompted image captioning
	- Few-shot prompting
	- Visual question answering
	- Image classification
	- Image-guided text generation
	- Run inference in batch mode
	- Run IDEFICS instruct for conversational use
	Before you begin, make sure you have all the necessary libraries installed.

	pip install -q bitsandbytes sentencepiece accelerate transformers

	To run the following examples with a non-quantized version of the model checkpoint you will need at least 20GB of GPU memory.

	Loading the model
	Let's start by loading the model's 9 billion parameters checkpoint:

	checkpoint = "HuggingFaceM4/idefics-9b"

	Just like for other Transformers models, you need to load a processor and the model itself from the checkpoint.
	The IDEFICS processor wraps a [LlamaTokenizer] and IDEFICS image processor into a single processor to take care of
	preparing text and image inputs for the model.

	import torch
	from transformers import IdeficsForVisionText2Text, AutoProcessor
	processor = AutoProcessor.from_pretrained(checkpoint)
	model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")

	Setting device_map to "auto" will automatically determine how to load and store the model weights in the most optimized
	manner given existing devices.
	Quantized model
	If high-memory GPU availability is an issue, you can load the quantized version of the model. To load the model and the
	processor in 4bit precision, pass a BitsAndBytesConfig to the from_pretrained method and the model will be compressed
	on the fly while loading.

	import torch
	from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig
	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.float16,
	)
	processor = AutoProcessor.from_pretrained(checkpoint)
	model = IdeficsForVisionText2Text.from_pretrained(
	checkpoint,
	quantization_config=quantization_config,
	device_map="auto"
	)

	Now that you have the model loaded in one of the suggested ways, let's move on to exploring tasks that you can use IDEFICS for.
	Image captioning
	Image captioning is the task of predicting a caption for a given image. A common application is to aid visually impaired
	people navigate through different situations, for instance, explore image content online.
	To illustrate the task, get an image to be captioned, e.g.:

	Photo by Hendo Wang.
	IDEFICS accepts text and image prompts. However, to caption an image, you do not have to provide a text prompt to the
	model, only the preprocessed input image. Without a text prompt, the model will start generating text from the
	BOS (beginning-of-sequence) token thus creating a caption.
	As image input to the model, you can use either an image object (PIL.Image) or a url from which the image can be retrieved.

	prompt = [
	"https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
	]
	inputs = processor(prompt, return_tensors="pt").to("cuda")
	bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids
	generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
	print(generated_text[0])
	A puppy in a flower bed

	It is a good idea to include the bad_words_ids in the call to generate to avoid errors arising when increasing
	the max_new_tokens: the model will want to generate a new <image> or <fake_token_around_image> token when there
	is no image being generated by the model.
	You can set it on-the-fly as in this guide, or store in the GenerationConfig as described in the Text generation strategies guide.

	Prompted image captioning
	You can extend image captioning by providing a text prompt, which the model will continue given the image. Let's take
	another image to illustrate:

	Photo by Denys Nevozhai.
	Textual and image prompts can be passed to the model's processor as a single list to create appropriate inputs.

	prompt = [
	"https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
	"This is an image of ",
	]
	inputs = processor(prompt, return_tensors="pt").to("cuda")
	bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids
	generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
	print(generated_text[0])
	This is an image of the Eiffel Tower in Paris, France.

	Few-shot prompting
	While IDEFICS demonstrates great zero-shot results, your task may require a certain format of the caption, or come with
	other restrictions or requirements that increase task's complexity. Few-shot prompting can be used to enable in-context learning.
	By providing examples in the prompt, you can steer the model to generate results that mimic the format of given examples.
	Let's use the previous image of the Eiffel Tower as an example for the model and build a prompt that demonstrates to the model
	that in addition to learning what the object in an image is, we would also like to get some interesting information about it.
	Then, let's see, if we can get the same response format for an image of the Statue of Liberty:

	Photo by Juan Mayobre.

	prompt = ["User:",
	"https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
	"Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n",
	"User:",
	"https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80",
	"Describe this image.\nAssistant:"
	]
	inputs = processor(prompt, return_tensors="pt").to("cuda")
	bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids
	generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
	print(generated_text[0])
	User: Describe this image.
	Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.
	User: Describe this image.
	Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.

	Notice that just from a single example (i.e., 1-shot) the model has learned how to perform the task. For more complex tasks,
	feel free to experiment with a larger number of examples (e.g., 3-shot, 5-shot, etc.).
	Visual question answering
	Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. Similar to image
	captioning it can be used in accessibility applications, but also in education (reasoning about visual materials), customer
	service (questions about products based on images), and image retrieval.
	Let's get a new image for this task:

	Photo by Jarritos Mexican Soda.
	You can steer the model from image captioning to visual question answering by prompting it with appropriate instructions:

	prompt = [
	"Instruction: Provide an answer to the question. Use the image to answer.\n",
	"https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
	"Question: Where are these people and what's the weather like? Answer:"
	]
	inputs = processor(prompt, return_tensors="pt").to("cuda")
	bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids
	generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
	print(generated_text[0])
	Instruction: Provide an answer to the question. Use the image to answer.
	Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.

	Image classification
	IDEFICS is capable of classifying images into different categories without being explicitly trained on data containing
	labeled examples from those specific categories. Given a list of categories and using its image and text understanding
	capabilities, the model can infer which category the image likely belongs to.
	Say, we have this image of a vegetable stand:

	Photo by Peter Wendt.
	We can instruct the model to classify the image into one of the categories that we have:

	categories = ['animals','vegetables', 'city landscape', 'cars', 'office']
	prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n",
	"https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
	"Category: "
	]
	inputs = processor(prompt, return_tensors="pt").to("cuda")
	bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids
	generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
	print(generated_text[0])
	Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
	Category: Vegetables
	```

	In the example above we instruct the model to classify the image into a single category, however, you can also prompt the model to do rank classification.
	Image-guided text generation
	For more creative applications, you can use image-guided text generation to generate text based on an image. This can be
	useful to create descriptions of products, ads, descriptions of a scene, etc.
	Let's prompt IDEFICS to write a story based on a simple image of a red door:

	Photo by Craig Tidball.

	prompt = ["Instruction: Use the image to write a story. \n",
	"https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
	"Story: \n"]
	inputs = processor(prompt, return_tensors="pt").to("cuda")
	bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids
	generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
	print(generated_text[0])
	Instruction: Use the image to write a story.
	Story:
	Once upon a time, there was a little girl who lived in a house with a red door. She loved her red door. It was the prettiest door in the whole world.

	One day, the little girl was playing in her yard when she noticed a man standing on her doorstep. He was wearing a long black coat and a top hat.
	The little girl ran inside and told her mother about the man.
	Her mother said, “Don’t worry, honey. He’s just a friendly ghost.”
	The little girl wasn’t sure if she believed her mother, but she went outside anyway.
	When she got to the door, the man was gone.
	The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep.
	He was wearing a long black coat and a top hat.
	The little girl ran

	Looks like IDEFICS noticed the pumpkin on the doorstep and went with a spooky Halloween story about a ghost.

	For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help
	you significantly improve the quality of the generated output. Check out Text generation strategies
	to learn more.

	Running inference in batch mode
	All of the earlier sections illustrated IDEFICS for a single example. In a very similar fashion, you can run inference
	for a batch of examples by passing a list of prompts:

	prompts = [
	[ "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
	"This is an image of ",
	],
	[ "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
	"This is an image of ",
	],
	[ "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
	"This is an image of ",
	],
	]
	inputs = processor(prompts, return_tensors="pt").to("cuda")
	bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids
	generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
	for i,t in enumerate(generated_text):
	print(f"{i}:\n{t}\n")
	0:
	This is an image of the Eiffel Tower in Paris, France.

	1:
	This is an image of a couple on a picnic blanket.
	2:
	This is an image of a vegetable stand.

	IDEFICS instruct for conversational use
	For conversational use cases, you can find fine-tuned instructed versions of the model on the 🤗 Hub:
	HuggingFaceM4/idefics-80b-instruct and HuggingFaceM4/idefics-9b-instruct.
	These checkpoints are the result of fine-tuning the respective base models on a mixture of supervised and instruction
	fine-tuning datasets, which boosts the downstream performance while making the models more usable in conversational settings.
	The use and prompting for the conversational use is very similar to using the base models:

	import torch
	from transformers import IdeficsForVisionText2Text, AutoProcessor
	device = "cuda" if torch.cuda.is_available() else "cpu"
	checkpoint = "HuggingFaceM4/idefics-9b-instruct"
	model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
	processor = AutoProcessor.from_pretrained(checkpoint)
	prompts = [
	[
	"User: What is in this image?",
	"https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
	"",

	"\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.",
	"\nUser:",
	"https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
	"And who is that?",
	"\nAssistant:",
	],
	]

	--batched mode
	inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
	--single sample mode
	inputs = processor(prompts[0], return_tensors="pt").to(device)
	Generation args
	exit_condition = processor.tokenizer("", add_special_tokens=False).input_ids
	bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids
	generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
	for i, t in enumerate(generated_text):
	print(f"{i}:\n{t}\n")