Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
There are fewer activation and normalization layers, the activation function is switched to GELU instead of ReLU, and it uses LayerNorm instead of BatchNorm.