Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
an activation layer was not added, or the residual connection was forgotten
The word embedding matrix was not tied
The wrong positional embeddings are used because the original implementation uses on offset
Dropout is applied during the forward pass.