Update README.md

38531a1 verified 16 days ago

7.05 kB

	---
	license: mit
	datasets:
	- ncbi/pubmed
	language:
	- en
	tags:
	- biomedical-text
	- nlp
	- biomedical-nlp
	- discharge-notes
	- healthcare
	- pubmed
	pipeline_tag: feature-extraction
	base_model:
	- answerdotai/ModernBERT-base
	library_name: transformers
	---

	Below is a draft Hugging Face model card for Clinical ModernBERT. The card emphasizes the masked language modeling setup and describes the pre-training optimizations.

	---

	# Clinical ModernBERT

	Clinical ModernBERT is a state-of-the-art encoder-based transformer tailored specifically for biomedical and clinical text handling context length up to 8192 tokens. Building on the innovations introduced by ModernBERT, this model extends the context window to 8,192 tokens and incorporates domain-specific vocabulary refinements. It is designed to produce semantically rich representations that capture both the nuanced syntax of biomedical literature and the intricate semantics of clinical narratives.

	## Usage

	Pretrained model weights and tokenizer artifacts are provided to facilitate easy integration with your downstream biomedical NLP tasks:

	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained('Simonlee711/Clinical_ModernBERT')
	tokenizer = AutoTokenizer.from_pretrained('Simonlee711/Clinical_ModernBERT')
	```

	## Model Overview

	Below is a table summarizing ModernBERT's key architectural components and their benefits:

	\| Feature \| Description \| Benefit \|
	\|------------------------------\|-----------------------------------------------------------------------------------------------------------\|------------------------------------------------------------------------------------------------------------------\|
	\| Extended Context Length \| Processes sequences up to 8,192 tokens. \| Captures long-range dependencies and full document contexts, essential for complex linguistic tasks. \|
	\| GeGLU Activation \| Uses the GeGLU activation, a gated variant of GeLU. \| Enhances non-linear representation and model stability by allowing controlled information flow. \|
	\| Rotary Positional Embeddings \| Implements RoPE to encode relative positional information. \| Provides robust handling of positional data, especially beneficial for extended contexts. \|
	\| Flash Attention \| Employs Flash Attention to compute self-attention blockwise. \| Reduces memory overhead from quadratic to near-linear complexity, enabling efficient processing of long sequences. \|

	This model leverages a suite of modern architectural advancements including rotary positional embeddings (RoPE), Flash Attention for near-linear memory usage with extended contexts, and GeGLU activation layers that enhance representational capacity by integrating smooth gating mechanisms. By initializing from a ModernBERT-base checkpoint and applying domain-specific pre-training on approximately 40 million PubMed abstracts combined with MIMIC-IV clinical notes, Clinical ModernBERT is optimized to serve in tasks such as retrieval-augmented generation, fine-grained text classification, and domain-specific entity extraction.

	## Pre-training Optimizations

	\| Parameter \| Value \| Description \|
	\|--------------------------\|---------------------\|---------------------------------------------------------------------\|
	\| Total Tokens \| 13,004,002,816 \| Total number of tokens in the unified pre-training corpus \|
	\| Pre-training Corpus \| PubMed + MIMIC-IV + Medical Codes & Descriptions \| Combination of approximately 40M PubMed abstracts and MIMIC-IV Clinical notes and medical code and description pairs (ICD 9 Code 250.00: Diabetes mellitus without mention of complication, type II or unspecified type, not stated as uncontrolled.) \|
	\| Training Steps \| 150,000 \| Total number of masked language modeling (MLM) training steps \|
	\| Batch Size \| 128 \| Batch size used during training \|


	## Masked Language Modeling (MLM) Setup

	Clinical ModernBERT is pre-trained using a multi-phase masked language modeling (MLM) strategy. A custom collator dynamically adjusts the masking probability—beginning at 30% and decreasing to 15% over the course of training—to emphasize medically relevant tokens (e.g., drug names, procedural codes). The MLM objective is defined as

	$$
	\mathcal{L}_{\text{MLM}} = - \sum_{i \in \mathcal{M}} \log p_\theta(x_i \mid \tilde{x}),
	$$

	where the model predicts masked tokens given their context. The table below summarizes the MLM evaluation framework without detailing specific metric scores:

	\| Metric \| Top-1 Accuracy \| Top-5 Accuracy \| Top-10 Accuracy \| Top-25 Accuracy \|
	\|------------------\|--------------------\|--------------------\|---------------------\|---------------------\|
	\| Value (%) \| 63.31 \| 79.67 \| 83.33 \| 88.10 \|

	This structure captures the granularity at which the model’s recovery of masked tokens is evaluated, with the understanding that higher top-$k$ values reflect a broader lexical recall, and the model consistently ranks clinically appropriate tokens among its predictions.


	## Intended Use

	Clinical ModernBERT is ideally suited for tasks that demand an in-depth understanding of biomedical language. It is particularly valuable for clinical information retrieval, narrative classification, and structured medical coding. Researchers and practitioners may fine-tune this model for specialized downstream applications such as electronic health record analysis, clinical decision support systems, and evidence-based medical literature retrieval.

	## Citations and Pre-training Source Code

	The source code can be found here: [Clinical ModernBERT Github](https://github.com/Simonlee711/Clinical_ModernBERT)

	Citing Model
	```
	@misc{simon_lee_2025,
	author = { Simon Lee },
	title = { Clinical_ModernBERT (Revision 24e72d6) },
	year = 2025,
	url = { https://huggingface.co/Simonlee711/Clinical_ModernBERT },
	doi = { 10.57967/hf/4999 },
	publisher = { Hugging Face }
	}
	```

	Citing Paper
	```
	@article{lee2025clinical,
	title={Clinical ModernBERT: An efficient and long context encoder for biomedical text},
	author={Lee, Simon A and Wu, Anthony and Chiang, Jeffrey N},
	journal={arXiv preprint arXiv:2504.03964},
	year={2025}
	}
	```

	## Questions

	email (simonlee711@g.ucla.edu)