File size: 2,652 Bytes
f17c055 215e586 f17c055 9160e17 f17c055 9160e17 f17c055 9160e17 736c1a7 9160e17 f17c055 b1ddeac 9160e17 b1ddeac 9160e17 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
license: cc-by-nc-sa-4.0
metrics:
- accuracy
- f1
- uar
pipeline_tag: audio-classification
tags:
- audio
- audio-classification
- speech-emotion-recognition
- autrainer
library_name: autrainer
model-index:
- name: msp-podcast-emo-class-big4-w2v2-l-emo
results:
- task:
type: audio-classification
name: Speech Emotion Recognition
metrics:
- type: accuracy
name: Accuracy
value: 0.6166793457588436
- type: f1
name: F1
value: 0.5716599171523286
- type: uar
name: Unweighted Average Recall
value: 0.6499883154795764
base_model:
- audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim
---
# Speech Emotion Recognition Model
`Wav2Vec2-Large-Robust` model fine-tuned on the [MSP-Podcast](https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html)
(v1.11) dataset for classifying emotions into four categories: _Anger (A)_, _Happiness (H)_, _Neutral (N)_, and _Sadness (S)_.
## Installation
To use the model, install autrainer, e.g., via pip:
```bash
pip install autrainer
```
## Usage
The model can be applied to all audio files in a folder (`<data-root>`) and stores the predictions in another folder (`<output-root>`):
```bash
autrainer inference hf:autrainer/msp-podcast-emo-class-big4-w2v2-l-emo <data-root> <output-root>
```
## Training
### Pretraining
The model has been originally trained on the MSP-Podcast (v1.7) dataset by [audEERING](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) to predict three emotional dimensions: _arousal_, _dominance_, and _valence_.
### Dataset
The model was further fine-tuned on the MSP-Podcast (v1.11) dataset, a large corpus of spontaneous emotional speech collected from various podcast recordings.
The dataset includes natural emotional expressions which cover a broad range of speakers, recording conditions, and conversation topics.
### Training Process
The model has been fine-tuned for 5 epochs.
At the end of each epoch, the model was evaluated on the validation set.
We release the state that achieved the best performance on this validation set.
All training hyperparameters can be found in the main configuration file (`conf/config.yaml`).
### Evaluation
We evaluate the model on the `Test1` split of the MSP-Podcast dataset.
The model achieves a classification unweighted average recall of 0.650 on the test set.
## Acknowledgements
Please acknowledge the work which produced the original model and the MSP-Podcast dataset.
We would also appreciate an acknowledgment to autrainer.
|