dustalov
/

wikitext-wordlevel

Model card Files Files and versions

wikitext-wordlevel / README.md

dustalov's picture

Update README.md

b0f9f16 verified over 1 year ago

|

history blame contribute delete

1.23 kB

	---
	library_name: tokenizers
	license: cc-by-sa-3.0
	datasets:
	- wikitext
	language:
	- en
	tags:
	- tokenizer
	- wordlevel
	- tokenizers
	- wikitext
	inference: false
	---

	# WikiText-WordLevel

	This is a simple word-level tokenizer created using the [Tokenizers](https://github.com/huggingface/tokenizers) library. It was trained for educational purposes on the combined train, validation, and test splits of the [WikiText-103](https://huggingface.co/datasets/wikitext) corpus.

	- Tokenizer Type: Word-Level
	- Vocabulary Size: 75K
	- Special Tokens: `<s>` (start of sequence), `</s>` (end of sequence), `<unk>` (unknown token)
	- Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
	- Pre-tokenization: Whitespace
	- Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)

	The tokenizer can be used as simple as follows.

	```python
	tokenizer = Tokenizer.from_pretrained('dustalov/wikitext-wordlevel')

	tokenizer.encode("I'll see you soon").ids # => [68, 14, 2746, 577, 184, 595]

	tokenizer.encode("I'll see you soon").tokens # => ['i', "'", 'll', 'see', 'you', 'soon']

	tokenizer.decode([68, 14, 2746, 577, 184, 595]) # => "i ' ll see you soon"
	```