File size: 9,542 Bytes
57bdca5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209


Before you begin, make sure you have all the necessary libraries installed:

pip install transformers datasets evaluate
We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

from huggingface_hub import notebook_login
notebook_login()

Load MInDS-14 dataset
Start by loading the MInDS-14 dataset from the 🤗 Datasets library:

from datasets import load_dataset, Audio
minds = load_dataset("PolyAI/minds14", name="en-US", split="train")

Split the dataset's train split into a smaller train and test set with the [~datasets.Dataset.train_test_split] method. This'll give you a chance to experiment and make sure everything works before spending more time on the full dataset.

minds = minds.train_test_split(test_size=0.2)

Then take a look at the dataset:

minds
DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

While the dataset contains a lot of useful information, like lang_id and english_transcription, you'll focus on the audio and intent_class in this guide. Remove the other columns with the [~datasets.Dataset.remove_columns] method:

minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])

Take a look at an example now:

minds["train"][0]
{'audio': {'array': array([ 0.        ,  0.        ,  0.        , , -0.00048828,
         -0.00024414, -0.00024414], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
  'sampling_rate': 8000},
 'intent_class': 2}

There are two fields:

audio: a 1-dimensional array of the speech signal that must be called to load and resample the audio file. 
intent_class: represents the class id of the speaker's intent. 

To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:

labels = minds["train"].features["intent_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
     label2id[label] = str(i)
     id2label[str(i)] = label

Now you can convert the label id to a label name:

id2label[str(2)]
'app_error'

Preprocess
The next step is to load a Wav2Vec2 feature extractor to process the audio signal:

from transformers import AutoFeatureExtractor
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

The MInDS-14 dataset has a sampling rate of 8000khz (you can find this information in it's dataset card), which means you'll need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:

minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]
{'audio': {'array': array([ 2.2098757e-05,  4.6582241e-05, -2.2803260e-05, ,
         -2.8419291e-04, -2.3305941e-04, -1.1425107e-04], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
  'sampling_rate': 16000},
 'intent_class': 2}

Now create a preprocessing function that:

Calls the audio column to load, and if necessary, resample the audio file.
Checks if the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with. You can find this information in the Wav2Vec2 model card.
Set a maximum input length to batch longer inputs without truncating them.

def preprocess_function(examples):
     audio_arrays = [x["array"] for x in examples["audio"]]
     inputs = feature_extractor(
         audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
     )
     return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [~datasets.Dataset.map] function. You can speed up map by setting batched=True to process multiple elements of the dataset at once. Remove the columns you don't need, and rename intent_class to label because that's the name the model expects:

encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
encoded_minds = encoded_minds.rename_column("intent_class", "label")

Evaluate
Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 Evaluate library. For this task, load the accuracy metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric):

import evaluate
accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to [~evaluate.EvaluationModule.compute] to calculate the accuracy:

import numpy as np
def compute_metrics(eval_pred):
     predictions = np.argmax(eval_pred.predictions, axis=1)
     return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

Your compute_metrics function is ready to go now, and you'll return to it when you setup your training.
Train

If you aren't familiar with finetuning a model with the [Trainer], take a look at the basic tutorial here!

You're ready to start training your model now! Load Wav2Vec2 with [AutoModelForAudioClassification] along with the number of expected labels, and the label mappings:

from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer
num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
     "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
 )

At this point, only three steps remain:

Define your training hyperparameters in [TrainingArguments]. The only required parameter is output_dir which specifies where to save your model. You'll push this model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer] will evaluate the accuracy and save the training checkpoint.
Pass the training arguments to [Trainer] along with the model, dataset, tokenizer, data collator, and compute_metrics function.
Call [~Trainer.train] to finetune your model.

training_args = TrainingArguments(
     output_dir="my_awesome_mind_model",
     evaluation_strategy="epoch",
     save_strategy="epoch",
     learning_rate=3e-5,
     per_device_train_batch_size=32,
     gradient_accumulation_steps=4,
     per_device_eval_batch_size=32,
     num_train_epochs=10,
     warmup_ratio=0.1,
     logging_steps=10,
     load_best_model_at_end=True,
     metric_for_best_model="accuracy",
     push_to_hub=True,
 )
trainer = Trainer(
     model=model,
     args=training_args,
     train_dataset=encoded_minds["train"],
     eval_dataset=encoded_minds["test"],
     tokenizer=feature_extractor,
     compute_metrics=compute_metrics,
 )
trainer.train()

Once training is completed, share your model to the Hub with the [~transformers.Trainer.push_to_hub] method so everyone can use your model:

trainer.push_to_hub()

For a more in-depth example of how to finetune a model for audio classification, take a look at the corresponding PyTorch notebook.

Inference
Great, now that you've finetuned a model, you can use it for inference!
Load an audio file you'd like to run inference on. Remember to resample the sampling rate of the audio file to match the sampling rate of the model if you need to!

from datasets import load_dataset, Audio
dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate = dataset.features["audio"].sampling_rate
audio_file = dataset[0]["audio"]["path"]

The simplest way to try out your finetuned model for inference is to use it in a [pipeline]. Instantiate a pipeline for audio classification with your model, and pass your audio file to it:

from transformers import pipeline
classifier = pipeline("audio-classification", model="stevhliu/my_awesome_minds_model")
classifier(audio_file)
[
    {'score': 0.09766869246959686, 'label': 'cash_deposit'},
    {'score': 0.07998877018690109, 'label': 'app_error'},
    {'score': 0.0781070664525032, 'label': 'joint_account'},
    {'score': 0.07667109370231628, 'label': 'pay_bill'},
    {'score': 0.0755252093076706, 'label': 'balance'}
]

You can also manually replicate the results of the pipeline if you'd like:

Load a feature extractor to preprocess the audio file and return the input as PyTorch tensors:

from transformers import AutoFeatureExtractor
feature_extractor = AutoFeatureExtractor.from_pretrained("stevhliu/my_awesome_minds_model")
inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

Pass your inputs to the model and return the logits:

from transformers import AutoModelForAudioClassification
model = AutoModelForAudioClassification.from_pretrained("stevhliu/my_awesome_minds_model")
with torch.no_grad():
     logits = model(**inputs).logits

Get the class with the highest probability, and use the model's id2label mapping to convert it to a label:

import torch
predicted_class_ids = torch.argmax(logits).item()
predicted_label = model.config.id2label[predicted_class_ids]
predicted_label
'cash_deposit'