Spaces:

Ahmadzei
/

RAG

Runtime error

RAG

File size: 476 Bytes

5fa1a76

Just
  separate your segments with the separation token tokenizer.sep_token (or </s>)

Same as BERT with better pretraining tricks:

dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
together to reach 512 tokens (so the sentences are in an order than may span several documents)
train with larger batches
use BPE with bytes as a subunit and not characters (because of unicode characters)
CamemBERT is a wrapper around RoBERTa.