Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Each speech candidate is passed through the speech encoder ([ClvpEncoder]) which converts them into a vector representation, and the text encoder ([ClvpEncoder]) converts the text tokens into the same latent space.