Although all its attention heads query on the whole input sequence for | |
generating the attention map from a global perspective, we observe some heads only need to learn local dependencies, | |
which means the existence of computation redundancy. |
Although all its attention heads query on the whole input sequence for | |
generating the attention map from a global perspective, we observe some heads only need to learn local dependencies, | |
which means the existence of computation redundancy. |