The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of | |
256 tokens. |
The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of | |
256 tokens. |