Remove any columns you don't need: tokenized_eli5 = eli5.map( preprocess_function, batched=True, num_proc=4, remove_columns=eli5["train"].column_names, ) This dataset contains the token sequences, but some of these are longer than the maximum input length for the model.