File size: 420 Bytes
5fa1a76 |
1 2 3 4 5 6 7 8 9 10 11 12 |
result = { k: [t[i : i + block_size] for i in range(0, total_length, block_size)] for k, t in concatenated_examples.items() } result["labels"] = result["input_ids"].copy() return result Apply the group_texts function over the entire dataset: lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4) Now create a batch of examples using [DataCollatorForLanguageModeling]. |