Clinical_ModernBERT / README.md
Simonlee711's picture
Update README.md
38531a1 verified
|
raw
history blame contribute delete
7.05 kB
---
license: mit
datasets:
- ncbi/pubmed
language:
- en
tags:
- biomedical-text
- nlp
- biomedical-nlp
- discharge-notes
- healthcare
- pubmed
pipeline_tag: feature-extraction
base_model:
- answerdotai/ModernBERT-base
library_name: transformers
---
Below is a draft Hugging Face model card for Clinical ModernBERT. The card emphasizes the masked language modeling setup and describes the pre-training optimizations.
---
# Clinical ModernBERT
Clinical ModernBERT is a state-of-the-art encoder-based transformer tailored specifically for biomedical and clinical text handling context length up to **8192 tokens**. Building on the innovations introduced by ModernBERT, this model extends the context window to 8,192 tokens and incorporates domain-specific vocabulary refinements. It is designed to produce semantically rich representations that capture both the nuanced syntax of biomedical literature and the intricate semantics of clinical narratives.
## Usage
Pretrained model weights and tokenizer artifacts are provided to facilitate easy integration with your downstream biomedical NLP tasks:
```python
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('Simonlee711/Clinical_ModernBERT')
tokenizer = AutoTokenizer.from_pretrained('Simonlee711/Clinical_ModernBERT')
```
## Model Overview
Below is a table summarizing ModernBERT's key architectural components and their benefits:
| **Feature** | **Description** | **Benefit** |
|------------------------------|-----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| Extended Context Length | Processes sequences up to **8,192 tokens**. | Captures long-range dependencies and full document contexts, essential for complex linguistic tasks. |
| GeGLU Activation | Uses the GeGLU activation, a gated variant of GeLU. | Enhances non-linear representation and model stability by allowing controlled information flow. |
| Rotary Positional Embeddings | Implements RoPE to encode relative positional information. | Provides robust handling of positional data, especially beneficial for extended contexts. |
| Flash Attention | Employs Flash Attention to compute self-attention blockwise. | Reduces memory overhead from quadratic to near-linear complexity, enabling efficient processing of long sequences. |
This model leverages a suite of modern architectural advancements including rotary positional embeddings (RoPE), Flash Attention for near-linear memory usage with extended contexts, and GeGLU activation layers that enhance representational capacity by integrating smooth gating mechanisms. By initializing from a ModernBERT-base checkpoint and applying domain-specific pre-training on approximately 40 million PubMed abstracts combined with MIMIC-IV clinical notes, Clinical ModernBERT is optimized to serve in tasks such as retrieval-augmented generation, fine-grained text classification, and domain-specific entity extraction.
## Pre-training Optimizations
| **Parameter** | **Value** | **Description** |
|--------------------------|---------------------|---------------------------------------------------------------------|
| Total Tokens | 13,004,002,816 | Total number of tokens in the unified pre-training corpus |
| Pre-training Corpus | PubMed + MIMIC-IV + Medical Codes & Descriptions | Combination of approximately 40M PubMed abstracts and MIMIC-IV Clinical notes and medical code and description pairs (ICD 9 Code 250.00: Diabetes mellitus without mention of complication, type II or unspecified type, not stated as uncontrolled.) |
| Training Steps | 150,000 | Total number of masked language modeling (MLM) training steps |
| Batch Size | 128 | Batch size used during training |
## Masked Language Modeling (MLM) Setup
Clinical ModernBERT is pre-trained using a multi-phase masked language modeling (MLM) strategy. A custom collator dynamically adjusts the masking probability—beginning at 30% and decreasing to 15% over the course of training—to emphasize medically relevant tokens (e.g., drug names, procedural codes). The MLM objective is defined as
$$
\mathcal{L}_{\text{MLM}} = - \sum_{i \in \mathcal{M}} \log p_\theta(x_i \mid \tilde{x}),
$$
where the model predicts masked tokens given their context. The table below summarizes the MLM evaluation framework without detailing specific metric scores:
| **Metric** | **Top-1 Accuracy** | **Top-5 Accuracy** | **Top-10 Accuracy** | **Top-25 Accuracy** |
|------------------|--------------------|--------------------|---------------------|---------------------|
| **Value (%)** | 63.31 | 79.67 | 83.33 | 88.10 |
This structure captures the granularity at which the model’s recovery of masked tokens is evaluated, with the understanding that higher top-\(k\) values reflect a broader lexical recall, and the model consistently ranks clinically appropriate tokens among its predictions.
## Intended Use
Clinical ModernBERT is ideally suited for tasks that demand an in-depth understanding of biomedical language. It is particularly valuable for clinical information retrieval, narrative classification, and structured medical coding. Researchers and practitioners may fine-tune this model for specialized downstream applications such as electronic health record analysis, clinical decision support systems, and evidence-based medical literature retrieval.
## Citations and Pre-training Source Code
The source code can be found here: [Clinical ModernBERT Github](https://github.com/Simonlee711/Clinical_ModernBERT)
Citing Model
```
@misc{simon_lee_2025,
author = { Simon Lee },
title = { Clinical_ModernBERT (Revision 24e72d6) },
year = 2025,
url = { https://huggingface.co/Simonlee711/Clinical_ModernBERT },
doi = { 10.57967/hf/4999 },
publisher = { Hugging Face }
}
```
Citing Paper
```
@article{lee2025clinical,
title={Clinical ModernBERT: An efficient and long context encoder for biomedical text},
author={Lee, Simon A and Wu, Anthony and Chiang, Jeffrey N},
journal={arXiv preprint arXiv:2504.03964},
year={2025}
}
```
## Questions
email (simonlee711@g.ucla.edu)