YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
SODA-VEC Negative Sampling: Biomedical Sentence Embeddings
Model Overview
SODA-VEC Negative Sampling is a specialized sentence embedding model trained on 26.5M biomedical text pairs using the MultipleNegativesRankingLoss from sentence-transformers. This model is optimized for biomedical and life sciences applications, providing high-quality semantic representations for scientific literature.
Key Features
- 𧬠Biomedical Specialization: Trained exclusively on PubMed abstracts and titles
- π¬ Large Scale: 26.5M training pairs from complete PubMed baseline (July 2024)
- β‘ Modern Architecture: Based on ModernBERT-embed-base with 768-dimensional embeddings
- π― Negative Sampling: Uses standard MultipleNegativesRankingLoss for robust contrastive learning
- π Production Ready: Optimized training with FP16, gradient clipping, and cosine scheduling
Model Details
Base Model
- Architecture: ModernBERT-embed-base (nomic-ai/modernbert-embed-base)
- Embedding Dimension: 768
- Max Sequence Length: 768 tokens
- Parameters: ~110M
Training Configuration
- Loss Function: MultipleNegativesRankingLoss (sentence-transformers)
- Training Data: 26,473,900 biomedical text pairs
- Epochs: 3
- Effective Batch Size: 256 (32 per GPU Γ 4 GPUs Γ 2 gradient accumulation)
- Learning Rate: 1e-5 with cosine scheduling
- Optimization: AdamW with weight decay (0.01)
- Precision: FP16 for efficiency
- Hardware: 4x Tesla V100-DGXS-32GB
Dataset
Source Data
- Origin: Complete PubMed baseline (July 2024)
- Content: Scientific abstracts and titles from biomedical literature
- Quality: 99.7% retention after filtering (128-6,000 character abstracts)
- Splits: 99.6% train / 0.2% validation / 0.2% test
Data Processing
- Error pattern removal and quality filtering
- Balanced train/validation/test splits
- Character length filtering for optimal training
- Duplicate detection and removal
Performance & Use Cases
Intended Applications
- Literature Search: Semantic search across biomedical publications
- Research Discovery: Finding related papers and concepts
- Knowledge Mining: Extracting relationships from scientific text
- Document Classification: Categorizing biomedical documents
- Similarity Analysis: Comparing research abstracts and papers
Biomedical Domains
- Molecular Biology
- Clinical Medicine
- Pharmacology
- Genetics & Genomics
- Biochemistry
- Neuroscience
- Public Health
Usage
Installation
pip install sentence-transformers
Basic Usage
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('EMBO/soda-vec-negative-sampling')
# Encode biomedical texts
texts = [
"CRISPR-Cas9 gene editing in human embryos",
"mRNA vaccine efficacy against COVID-19 variants",
"Protein folding mechanisms in neurodegenerative diseases"
]
embeddings = model.encode(texts)
print(f"Embeddings shape: {embeddings.shape}") # (3, 768)
Semantic Search
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Query and corpus
query = "Alzheimer's disease biomarkers"
corpus = [
"Tau protein aggregation in neurodegeneration",
"COVID-19 vaccine development strategies",
"Beta-amyloid plaques in dementia patients"
]
# Encode
query_embedding = model.encode([query])
corpus_embeddings = model.encode(corpus)
# Find most similar
similarities = cosine_similarity(query_embedding, corpus_embeddings)[0]
best_match = np.argmax(similarities)
print(f"Best match: {corpus[best_match]} (similarity: {similarities[best_match]:.3f})")
Training Details
Loss Function
The model uses MultipleNegativesRankingLoss, which:
- Treats all other samples in a batch as negatives
- Optimizes for high similarity between related texts
- Provides robust contrastive learning without explicit negative sampling
- Well-established in sentence-transformers ecosystem
Training Process
- Duration: ~4 days on 4x V100 GPUs
- Steps: 310,239 total training steps
- Evaluation: Every 1000 steps (310 evaluations, 1.8% overhead)
- Monitoring: Real-time TensorBoard logging
- Checkpointing: Model saved at end of each epoch
Optimization Features
- Gradient clipping (max_norm=5.0) for training stability
- Weight decay regularization for generalization
- Cosine learning rate scheduling
- Loss-only evaluation for efficiency
- Reproducible training (seed=42)
Technical Specifications
Hardware Requirements
- Training: 4x Tesla V100-DGXS-32GB (recommended)
- Inference: Any GPU with 4GB+ VRAM, or CPU
- Memory: ~2GB GPU memory for inference
Software Dependencies
- sentence-transformers >= 2.0.0
- transformers >= 4.20.0
- torch >= 1.12.0
- Python >= 3.8
Comparison with SODA-VEC (VICReg)
Feature | SODA-VEC (VICReg) | SODA-VEC Negative Sampling |
---|---|---|
Loss Function | VICReg (custom biomedical) | MultipleNegativesRankingLoss |
Optimization | Empirically tuned coefficients | Standard contrastive learning |
Training Data | Same (26.5M pairs) | Same (26.5M pairs) |
Use Case | Biomedical research focus | General semantic similarity |
Framework | Custom implementation | sentence-transformers standard |
Limitations
- Domain Specificity: Optimized for biomedical text, may not generalize to other domains
- Language: English-only training data
- Recency: Training data cutoff at July 2024
- Bias: May reflect biases present in PubMed literature
Citation
If you use this model in your research, please cite:
@misc{soda-vec-negative-sampling-2024,
title={SODA-VEC Negative Sampling: Biomedical Sentence Embeddings},
author={EMBO},
year={2024},
url={https://huggingface.co/EMBO/soda-vec-negative-sampling},
note={Trained on 26.5M PubMed text pairs using MultipleNegativesRankingLoss}
}
License
This model is released under the same license as the base ModernBERT model. Please refer to the original model card for licensing details.
Acknowledgments
- Base Model: nomic-ai/modernbert-embed-base
- Training Framework: sentence-transformers
- Data Source: PubMed/MEDLINE database
- Infrastructure: EMBO computational resources
Model Card Contact
For questions about this model, please contact EMBO or open an issue in the associated repository.
Last Updated: August 2024
Model Version: 1.0
Training Completion: In Progress (ETA: 4 days)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support