Here we'll explicitly send the model to a | |
GPU, if available, which we didn't need to do earlier when training, as [Trainer] handles this automatically: | |
from transformers import AutoProcessor, Blip2ForConditionalGeneration | |
import torch | |
processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b") | |
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16) | |
device = "cuda" if torch.cuda.is_available() else "cpu" | |
model.to(device) | |
The model takes image and text as input, so let's use the exact same image/question pair from the first example in the VQA dataset: | |
example = dataset[0] | |
image = Image.open(example['image_id']) | |
question = example['question'] | |
To use BLIP-2 for visual question answering task, the textual prompt has to follow a specific format: Question: {} Answer:. |