File size: 2,652 Bytes
f17c055
215e586
f17c055
9160e17
 
 
f17c055
 
9160e17
 
 
 
f17c055
 
9160e17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
736c1a7
9160e17
f17c055
b1ddeac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9160e17
b1ddeac
 
 
 
9160e17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
license: cc-by-nc-sa-4.0
metrics:
  - accuracy
  - f1
  - uar
pipeline_tag: audio-classification
tags:
  - audio
  - audio-classification
  - speech-emotion-recognition
  - autrainer
library_name: autrainer
model-index:
  - name: msp-podcast-emo-class-big4-w2v2-l-emo
    results:
      - task:
          type: audio-classification
          name: Speech Emotion Recognition
        metrics:
          - type: accuracy
            name: Accuracy
            value: 0.6166793457588436
          - type: f1
            name: F1
            value: 0.5716599171523286
          - type: uar
            name: Unweighted Average Recall
            value: 0.6499883154795764
base_model:
  - audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim
---

# Speech Emotion Recognition Model

`Wav2Vec2-Large-Robust` model fine-tuned on the [MSP-Podcast](https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html)
(v1.11) dataset for classifying emotions into four categories: _Anger (A)_, _Happiness (H)_, _Neutral (N)_, and _Sadness (S)_.

## Installation

To use the model, install autrainer, e.g., via pip:

```bash
pip install autrainer
```

## Usage

The model can be applied to all audio files in a folder (`<data-root>`) and stores the predictions in another folder (`<output-root>`):

```bash
autrainer inference hf:autrainer/msp-podcast-emo-class-big4-w2v2-l-emo <data-root> <output-root>
```

## Training

### Pretraining

The model has been originally trained on the MSP-Podcast (v1.7) dataset by [audEERING](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) to predict three emotional dimensions: _arousal_, _dominance_, and _valence_.

### Dataset

The model was further fine-tuned on the MSP-Podcast (v1.11) dataset, a large corpus of spontaneous emotional speech collected from various podcast recordings.
The dataset includes natural emotional expressions which cover a broad range of speakers, recording conditions, and conversation topics.

### Training Process

The model has been fine-tuned for 5 epochs.
At the end of each epoch, the model was evaluated on the validation set.
We release the state that achieved the best performance on this validation set.
All training hyperparameters can be found in the main configuration file (`conf/config.yaml`).

### Evaluation

We evaluate the model on the `Test1` split of the MSP-Podcast dataset.
The model achieves a classification unweighted average recall of 0.650 on the test set.

## Acknowledgements

Please acknowledge the work which produced the original model and the MSP-Podcast dataset.
We would also appreciate an acknowledgment to autrainer.