Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Preprocess
The next step is to load a Wav2Vec2 processor to process the audio signal:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")
The MInDS-14 dataset has a sampling rate of 8000kHz (you can find this information in its dataset card), which means you'll need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]
{'audio': {'array': array([-2.38064706e-04, -1.58618059e-04, -5.43987835e-06, ,
2.78103951e-04, 2.38446111e-04, 1.18740834e-04], dtype=float32),
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
'sampling_rate': 16000},
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
As you can see in the transcription above, the text contains a mix of upper and lowercase characters.