The following code example runs a forward pass using the MMS-TTS English checkpoint: thon import torch from transformers import VitsTokenizer, VitsModel, set_seed tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") model = VitsModel.from_pretrained("facebook/mms-tts-eng") inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt") set_seed(555) # make deterministic with torch.no_grad(): outputs = model(**inputs) waveform = outputs.waveform[0] The resulting waveform can be saved as a .wav file: thon import scipy scipy.io.wavfile.write("synthesized_speech.wav", rate=model.config.sampling_rate, data=waveform) Or displayed in a Jupyter Notebook / Google Colab: thon from IPython.display import Audio Audio(waveform, rate=model.config.sampling_rate) For certain languages with non-Roman alphabets, such as Arabic, Mandarin or Hindi, the uroman perl package is required to pre-process the text inputs to the Roman alphabet.