Our analysis shows that larger | |
output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage | |
Transformer representations to be more general and more transferable to other tasks and languages. |
Our analysis shows that larger | |
output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage | |
Transformer representations to be more general and more transferable to other tasks and languages. |