Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Within each decoder block, GPT-2 uses a masked self-attention layer which means GPT-2 can't attend to future tokens.