The only preprocessing you have to do is to take the argmax of our predicted logits: import evaluate metric = evaluate.load("accuracy") def compute_metrics(eval_pred): predictions = np.argmax(eval_pred.predictions, axis=1) return metric.compute(predictions=predictions, references=eval_pred.label_ids) A note on evaluation: In the VideoMAE paper, the authors use the following evaluation strategy.