This attention mask is in the dictionary returned by the tokenizer under the key "attention_mask": thon padded_sequences["attention_mask"] [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]] autoencoding models See encoder models and masked language modeling autoregressive models See causal language modeling and decoder models B backbone The backbone is the network (embeddings and layers) that outputs the raw hidden states or features.