Utilize chunking with MegaConfig.use_chunking and control chunk size with MegaConfig.chunk_size Implementation Notes The original implementation of MEGA had an inconsistent expectation of attention masks for padding and causal self-attention between the softmax attention and Laplace/squared ReLU method.