Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Our analysis shows that larger
output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage
Transformer representations to be more general and more transferable to other tasks and languages.