File size: 240 Bytes
5fa1a76
 
 
1
2
3
Our analysis shows that larger
output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage
Transformer representations to be more general and more transferable to other tasks and languages.