File size: 199 Bytes
5fa1a76
 
 
1
2
3
The abstract from the paper is the following:
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
quadratically with the sequence length.