def collate_fn(batch):
     pixel_values = [item["pixel_values"] for item in batch]
     encoding = image_processor.pad(pixel_values, return_tensors="pt")
     labels = [item["labels"] for item in batch]
     batch = {}
     batch["pixel_values"] = encoding["pixel_values"]
     batch["pixel_mask"] = encoding["pixel_mask"]
     batch["labels"] = labels
     return batch

Multimodal
For tasks involving multimodal inputs, you'll need a processor to prepare your dataset for the model.