Load the LJ Speech dataset (see the 🤗 Datasets tutorial for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR): | |
from datasets import load_dataset | |
lj_speech = load_dataset("lj_speech", split="train") | |
For ASR, you're mainly focused on audio and text so you can remove the other columns: | |
lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"]) | |
Now take a look at the audio and text columns: | |
lj_speech[0]["audio"] | |
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, , | |
7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32), | |
'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav', | |
'sampling_rate': 22050} | |
lj_speech[0]["text"] | |
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition' | |
Remember you should always resample your audio dataset's sampling rate to match the sampling rate of the dataset used to pretrain a model! |