The following code example runs a forward pass using the MMS-TTS English checkpoint: | |
thon | |
import torch | |
from transformers import VitsTokenizer, VitsModel, set_seed | |
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") | |
model = VitsModel.from_pretrained("facebook/mms-tts-eng") | |
inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt") | |
set_seed(555) # make deterministic | |
with torch.no_grad(): | |
outputs = model(**inputs) | |
waveform = outputs.waveform[0] | |
The resulting waveform can be saved as a .wav file: | |
thon | |
import scipy | |
scipy.io.wavfile.write("synthesized_speech.wav", rate=model.config.sampling_rate, data=waveform) | |
Or displayed in a Jupyter Notebook / Google Colab: | |
thon | |
from IPython.display import Audio | |
Audio(waveform, rate=model.config.sampling_rate) | |
For certain languages with non-Roman alphabets, such as Arabic, Mandarin or Hindi, the uroman | |
perl package is required to pre-process the text inputs to the Roman alphabet. |