Take a look at the sequence length of these two audio samples: dataset[0]["audio"]["array"].shape (173398,) dataset[1]["audio"]["array"].shape (106496,) Create a function to preprocess the dataset so the audio samples are the same lengths.