Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder.