File size: 240 Bytes
5fa1a76
 
 
1
2
3
Although all its attention heads query on the whole input sequence for
generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
which means the existence of computation redundancy.