For this task, load the word error rate (WER) metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric): import evaluate wer = evaluate.load("wer") Then create a function that passes your predictions and labels to [~evaluate.EvaluationModule.compute] to calculate the WER: import numpy as np def compute_metrics(pred): pred_logits = pred.predictions pred_ids = np.argmax(pred_logits, axis=-1) pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id pred_str = processor.batch_decode(pred_ids) label_str = processor.batch_decode(pred.label_ids, group_tokens=False) wer = wer.compute(predictions=pred_str, references=label_str) return {"wer": wer} Your compute_metrics function is ready to go now, and you'll return to it when you setup your training.