Spaces:

Ahmadzei
/

RAG

Runtime error

added 3 more tables for large emb model

5fa1a76 over 1 year ago

555 Bytes

	def extract_all_chars(batch):
	all_text = " ".join(batch["normalized_text"])
	vocab = list(set(all_text))
	return {"vocab": [vocab], "all_text": [all_text]}
	vocabs = dataset.map(
	extract_all_chars,
	batched=True,
	batch_size=-1,
	keep_in_memory=True,
	remove_columns=dataset.column_names,
	)
	dataset_vocab = set(vocabs["vocab"][0])
	tokenizer_vocab = {k for k, _ in tokenizer.get_vocab().items()}

	Now you have two sets of characters: one with the vocabulary from the dataset and one with the vocabulary from the tokenizer.