Use the end-of-sequence token as the padding token and specify mlm_probability to randomly mask tokens each time you iterate over the data: | |
from transformers import DataCollatorForLanguageModeling | |
tokenizer.pad_token = tokenizer.eos_token | |
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) | |
Use the end-of-sequence token as the padding token and specify mlm_probability to randomly mask tokens each time you iterate over the data: | |
from transformers import DataCollatorForLanguageModeling | |
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf") | |
Train | |
If you aren't familiar with finetuning a model with the [Trainer], take a look at the basic tutorial here! |