Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of
256 tokens.