Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
from datasets import load_dataset, Audio
minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]")
Split the dataset's train split into a train and test set with the [~Dataset.train_test_split] method:
minds = minds.train_test_split(test_size=0.2)
Then take a look at the dataset:
minds
DatasetDict({
train: Dataset({
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
num_rows: 16
})
test: Dataset({
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
num_rows: 4
})
})
While the dataset contains a lot of useful information, like lang_id and english_transcription, you'll focus on the audio and transcription in this guide.