Here is how you can create a function to realign the tokens and labels, and truncate sequences to be no longer than DistilBERT's maximum input length: | |
def tokenize_and_align_labels(examples): | |
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True) | |
labels = [] | |
for i, label in enumerate(examples[f"ner_tags"]): | |
word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word. |