Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
feed forward chunking
In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.