Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Here we'll explicitly send the model to a
GPU, if available, which we didn't need to do earlier when training, as [Trainer] handles this automatically:
from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch
processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
The model takes image and text as input, so let's use the exact same image/question pair from the first example in the VQA dataset:
example = dataset[0]
image = Image.open(example['image_id'])
question = example['question']
To use BLIP-2 for visual question answering task, the textual prompt has to follow a specific format: Question: {} Answer:.