def collate_fn(batch): pixel_values = [item["pixel_values"] for item in batch] encoding = image_processor.pad(pixel_values, return_tensors="pt") labels = [item["labels"] for item in batch] batch = {} batch["pixel_values"] = encoding["pixel_values"] batch["pixel_mask"] = encoding["pixel_mask"] batch["labels"] = labels return batch Multimodal For tasks involving multimodal inputs, you'll need a processor to prepare your dataset for the model.