Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. |
Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. |