Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Load the LJ Speech dataset (see the 🤗 Datasets tutorial for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):
from datasets import load_dataset
lj_speech = load_dataset("lj_speech", split="train")
For ASR, you're mainly focused on audio and text so you can remove the other columns:
lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])
Now take a look at the audio and text columns:
lj_speech[0]["audio"]
{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ,
7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
'sampling_rate': 22050}
lj_speech[0]["text"]
'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
Remember you should always resample your audio dataset's sampling rate to match the sampling rate of the dataset used to pretrain a model!