Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
raw
history blame contribute delete
307 Bytes
Utilize chunking with MegaConfig.use_chunking and control chunk size with MegaConfig.chunk_size
Implementation Notes
The original implementation of MEGA had an inconsistent expectation of attention masks for padding and causal self-attention between the softmax attention and Laplace/squared ReLU method.