We also show that | |
allocating additional capacity to the output embedding provides benefits to the model that persist through the | |
fine-tuning stage even though the output embedding is discarded after pre-training. |
We also show that | |
allocating additional capacity to the output embedding provides benefits to the model that persist through the | |
fine-tuning stage even though the output embedding is discarded after pre-training. |