|
--- |
|
library_name: tokenizers |
|
license: cc-by-sa-3.0 |
|
datasets: |
|
- wikitext |
|
language: |
|
- en |
|
tags: |
|
- tokenizer |
|
- wordlevel |
|
- tokenizers |
|
- wikitext |
|
inference: false |
|
--- |
|
|
|
# WikiText-WordLevel |
|
|
|
This is a simple word-level tokenizer created using the [Tokenizers](https://github.com/huggingface/tokenizers) library. It was trained for educational purposes on the combined train, validation, and test splits of the [WikiText-103](https://huggingface.co/datasets/wikitext) corpus. |
|
|
|
- Tokenizer Type: Word-Level |
|
- Vocabulary Size: 75K |
|
- Special Tokens: `<s>` (start of sequence), `</s>` (end of sequence), `<unk>` (unknown token) |
|
- Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase |
|
- Pre-tokenization: Whitespace |
|
- Code: [wikitext-wordlevel.py](wikitext-wordlevel.py) |
|
|
|
The tokenizer can be used as simple as follows. |
|
|
|
```python |
|
tokenizer = Tokenizer.from_pretrained('dustalov/wikitext-wordlevel') |
|
|
|
tokenizer.encode("I'll see you soon").ids # => [68, 14, 2746, 577, 184, 595] |
|
|
|
tokenizer.encode("I'll see you soon").tokens # => ['i', "'", 'll', 'see', 'you', 'soon'] |
|
|
|
tokenizer.decode([68, 14, 2746, 577, 184, 595]) # => "i ' ll see you soon" |
|
``` |
|
|