|
", |
|
] |
|
encoded_inputs = tokenizer(batch_sentences) |
|
print(encoded_inputs) |
|
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], |
|
[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], |
|
[101, 1327, 1164, 5450, 23434, 136, 102]], |
|
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], |
|
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], |
|
[0, 0, 0, 0, 0, 0, 0]], |
|
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], |
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], |
|
[1, 1, 1, 1, 1, 1, 1]]} |
|
|
|
Pad |
|
Sentences aren't always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. |