Let's get it from an example in the test dataset: example = dataset["test"][304] speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0) Now you can pass the text and speaker embeddings to the pipeline, and it will take care of the rest: forward_params = {"speaker_embeddings": speaker_embeddings} output = pipe(text, forward_params=forward_params) output {'audio': array([-6.82714235e-05, -4.26525949e-04, 1.06134125e-04, , -1.22392643e-03, -7.76011671e-04, 3.29112721e-04], dtype=float32), 'sampling_rate': 16000} You can then listen to the result: from IPython.display import Audio Audio(output['audio'], rate=output['sampling_rate']) Run inference manually You can achieve the same inference results without using the pipeline, however, more steps will be required.