DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

A novel self-supervised speech representation learning model combining masked language modeling with self-distillation and online clustering techniques. Achieves SOTA performance on various speech processing tasks.

Model Details
Usage
Training
Evaluation
Citation
Additional Information

Model Details

Developers

Alexander H. Liu, Heng-Jui Chang (MIT CSAIL)
Michael Auli, Wei-Ning Hsu (Meta AI)
James Glass (MIT CSAIL)

Model Type

Self-supervised speech representation learning (Wav2Vec2 architecture variant)

Key Features

Self-distillation with teacher-student framework
Dynamic online clustering
Contextualized masking strategy
Combined contrastive + diversity losses

Usage

Feature Extraction

from transformers import Wav2Vec2ForPreTraining, Wav2Vec2FeatureExtractor
import torch
import librosa

# Load model components
model = Wav2Vec2ForPreTraining.from_pretrained("MohammadJRanjbar/DinoSR")
processor = Wav2Vec2FeatureExtractor.from_pretrained("MohammadJRanjbar/DinoSR")

# Process audio
audio, sr = librosa.load("speech.wav", sr=16000)
inputs = processor(audio, return_tensors="pt", sampling_rate=16000)

# Extract representations
with torch.no_grad():
    outputs = model(**inputs)
    
speech_features = outputs.projected_states  # [batch_size, seq_len, 256]

Fine-tuning for ASR

from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "MohammadJRanjbar/DinoSR",
    attention_dropout=0.1,
    hidden_dropout=0.1,
    layerdrop=0.1,
    ctc_loss_reduction="mean"
)

# Freeze feature encoder
model.freeze_feature_encoder()

Citation

@article{liu2023dinosr,
  title={DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning},
  author={Liu, Alexander H and Chang, Heng-Jui and Auli, Michael and Hsu, Wei-Ning and Glass, James},
  journal={arXiv preprint arXiv:2305.10005},
  year={2023}
}

Additional Information

Resources

Original Paper
GitHub Repository
Hugging Face Documentation
This model was converted from Fairseq to Hugging Face using convert.py script. For the original model, check the GitHub repository.

Contact

For questions and feedback:

Alexander H. Liu: alexhliu@mit.edu
Model maintainer: MohammadJRanjbar

This model card was generated using best practices from Model Card Creator

MohammadJRanjbar
/

DinoSR