Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
We also show that
allocating additional capacity to the output embedding provides benefits to the model that persist through the
fine-tuning stage even though the output embedding is discarded after pre-training.