an activation layer was not added, or the residual connection was forgotten | |
The word embedding matrix was not tied | |
The wrong positional embeddings are used because the original implementation uses on offset | |
Dropout is applied during the forward pass. |