The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them.