BSC-LT
/

roberta-base-bne

national library of spain

Model card Files Files and versions Community

asier-gutierrez commited on Aug 6, 2021

Commit

e2f38ec

·

1 Parent(s): 82fd9d5

Update README.md

Files changed (1) hide show

README.md +16 -9

README.md CHANGED Viewed

@@ -20,10 +20,22 @@ widget:
 # RoBERTa base trained with data from National Library of Spain (BNE)
-## Introduction
-This work presents the Spanish RoBERTa-base model. The model has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019.
-## Evaluation
 For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).
 ## Citing
@@ -38,9 +50,4 @@ Check out our paper for all the details: https://arxiv.org/abs/2107.07253
       archivePrefix={arXiv},
       primaryClass={cs.CL}
 }
-```
-## Corpora
-| Corpora | Number of documents | Size (GB) |
-|---------|---------------------|-----------|
-| BNE     |         201,080,084 |     570GB |

 # RoBERTa base trained with data from National Library of Spain (BNE)
+## Model Description
+RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa]() base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019.
+## Training corpora and preprocessing
+We cleaned 59TB of WARC files and we deduplicated them at computing node level. This resulted into 2TB of Spanish clean corpus. After that, we performed a global deduplication resulting into 570GB of text.
+Some of the statistics of the corpus:
+| Corpora | Number of documents | Number of tokens | Size (GB) |
+|---------|---------------------|------------------|-----------|
+| BNE     |         201,080,084 |  135,733,450,668 |     570GB |
+## Tokenization and pre-training
+We trained a BBPE tokenizer with a size of 50,262 tokens. We used 10,000 documents for validation and we trained the model for 48 hours into 16 computing nodes with 4 Nvidia V100 GPUs per node.
+## Evaluation and results
 For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).
 ## Citing
       archivePrefix={arXiv},
       primaryClass={cs.CL}
 }
+```