Attention is only computed within a local window, and the window is shifted between attention layers to create connections to help the model learn better.