Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Although all its attention heads query on the whole input sequence for
generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
which means the existence of computation redundancy.