File size: 1,231 Bytes
09725b3
571c928
09725b3
c9d9e0d
 
 
 
 
 
ac92cb7
ef99e22
 
9cc9058
ef99e22
 
 
 
 
 
 
 
 
 
 
 
b0f9f16
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---
library_name: tokenizers
license: cc-by-sa-3.0
datasets:
- wikitext
language:
- en
tags:
- tokenizer
- wordlevel
- tokenizers
- wikitext
inference: false
---

# WikiText-WordLevel

This is a simple word-level tokenizer created using the [Tokenizers](https://github.com/huggingface/tokenizers) library. It was trained for educational purposes on the combined train, validation, and test splits of the [WikiText-103](https://huggingface.co/datasets/wikitext) corpus.

- Tokenizer Type: Word-Level
- Vocabulary Size: 75K
- Special Tokens: `<s>` (start of sequence), `</s>` (end of sequence), `<unk>` (unknown token)
- Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
- Pre-tokenization: Whitespace
- Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)

The tokenizer can be used as simple as follows.

```python
tokenizer = Tokenizer.from_pretrained('dustalov/wikitext-wordlevel')

tokenizer.encode("I'll see you soon").ids  # => [68, 14, 2746, 577, 184, 595]

tokenizer.encode("I'll see you soon").tokens  # => ['i', "'", 'll', 'see', 'you', 'soon']

tokenizer.decode([68, 14, 2746, 577, 184, 595])  # => "i ' ll see you soon"
```