CardioBERTa_base.nl / README.md
UMCU's picture
Update README.md
d369f15 verified
metadata
license: gpl-3.0
language:
  - nl
base_model:
  - CLTL/MedRoBERTa.nl
tags:
  - medical
  - healthcare
metrics:
  - perplexity
library_name: transformers

Continued, off-premise, pre-training of MedRoBERTa.nl using about 50GB of open Dutch and translated English corpora.

Data statistics

Sources:

  • Dutch: medical guidelines (FMS, NHG)
  • Dutch: NtvG papers
  • English: Pubmed abstracts
  • English: PMC abstracts translated using DeepL
  • English: Apollo guidelines, papers and books
  • English: Meditron guidelines
  • English: MIMIC3
  • English: MIMIC CXR
  • English: MIMIC4

All translated (if not with DeepL) with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.

  • Number of tokens: 15B
  • Number of documents: 27M

Training

  • Effective batch size: 5120
  • Learning rate: 2e-4
  • Weight decay: 1e-3
  • Learning schedule: linear, with 5_000 warmup steps
  • Num epochs: ~3

Train perplexity: 3.0 Validation perplexity: 3.0

Acknowledgement

This work was done together with the Amsterdam UMC, in the context of the DataTools4Heart project.

We were happy to be able to use the Google TPU research cloud for training the model.