5fa1a76
1
2
3
The abstract from the paper is the following: Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.