Commit
·
e2f38ec
1
Parent(s):
82fd9d5
Update README.md
Browse files
README.md
CHANGED
@@ -20,10 +20,22 @@ widget:
|
|
20 |
|
21 |
# RoBERTa base trained with data from National Library of Spain (BNE)
|
22 |
|
23 |
-
##
|
24 |
-
|
25 |
|
26 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).
|
28 |
|
29 |
## Citing
|
@@ -38,9 +50,4 @@ Check out our paper for all the details: https://arxiv.org/abs/2107.07253
|
|
38 |
archivePrefix={arXiv},
|
39 |
primaryClass={cs.CL}
|
40 |
}
|
41 |
-
```
|
42 |
-
|
43 |
-
## Corpora
|
44 |
-
| Corpora | Number of documents | Size (GB) |
|
45 |
-
|---------|---------------------|-----------|
|
46 |
-
| BNE | 201,080,084 | 570GB |
|
|
|
20 |
|
21 |
# RoBERTa base trained with data from National Library of Spain (BNE)
|
22 |
|
23 |
+
## Model Description
|
24 |
+
RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa]() base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019.
|
25 |
|
26 |
+
## Training corpora and preprocessing
|
27 |
+
We cleaned 59TB of WARC files and we deduplicated them at computing node level. This resulted into 2TB of Spanish clean corpus. After that, we performed a global deduplication resulting into 570GB of text.
|
28 |
+
|
29 |
+
Some of the statistics of the corpus:
|
30 |
+
|
31 |
+
| Corpora | Number of documents | Number of tokens | Size (GB) |
|
32 |
+
|---------|---------------------|------------------|-----------|
|
33 |
+
| BNE | 201,080,084 | 135,733,450,668 | 570GB |
|
34 |
+
|
35 |
+
## Tokenization and pre-training
|
36 |
+
We trained a BBPE tokenizer with a size of 50,262 tokens. We used 10,000 documents for validation and we trained the model for 48 hours into 16 computing nodes with 4 Nvidia V100 GPUs per node.
|
37 |
+
|
38 |
+
## Evaluation and results
|
39 |
For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).
|
40 |
|
41 |
## Citing
|
|
|
50 |
archivePrefix={arXiv},
|
51 |
primaryClass={cs.CL}
|
52 |
}
|
53 |
+
```
|
|
|
|
|
|
|
|
|
|