Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
raw
history blame contribute delete
799 Bytes
To process your dataset in one step, use 🤗 Datasets map method to apply a preprocessing function over the entire dataset:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
If you like, you can create a smaller subset of the full dataset to fine-tune on to reduce the time it takes:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
Train
At this point, you should follow the section corresponding to the framework you want to use.