YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SODA-VEC Negative Sampling: Biomedical Sentence Embeddings

Model Overview

SODA-VEC Negative Sampling is a specialized sentence embedding model trained on 26.5M biomedical text pairs using the MultipleNegativesRankingLoss from sentence-transformers. This model is optimized for biomedical and life sciences applications, providing high-quality semantic representations for scientific literature.

Key Features

  • 🧬 Biomedical Specialization: Trained exclusively on PubMed abstracts and titles
  • πŸ”¬ Large Scale: 26.5M training pairs from complete PubMed baseline (July 2024)
  • ⚑ Modern Architecture: Based on ModernBERT-embed-base with 768-dimensional embeddings
  • 🎯 Negative Sampling: Uses standard MultipleNegativesRankingLoss for robust contrastive learning
  • πŸ“Š Production Ready: Optimized training with FP16, gradient clipping, and cosine scheduling

Model Details

Base Model

  • Architecture: ModernBERT-embed-base (nomic-ai/modernbert-embed-base)
  • Embedding Dimension: 768
  • Max Sequence Length: 768 tokens
  • Parameters: ~110M

Training Configuration

  • Loss Function: MultipleNegativesRankingLoss (sentence-transformers)
  • Training Data: 26,473,900 biomedical text pairs
  • Epochs: 3
  • Effective Batch Size: 256 (32 per GPU Γ— 4 GPUs Γ— 2 gradient accumulation)
  • Learning Rate: 1e-5 with cosine scheduling
  • Optimization: AdamW with weight decay (0.01)
  • Precision: FP16 for efficiency
  • Hardware: 4x Tesla V100-DGXS-32GB

Dataset

Source Data

  • Origin: Complete PubMed baseline (July 2024)
  • Content: Scientific abstracts and titles from biomedical literature
  • Quality: 99.7% retention after filtering (128-6,000 character abstracts)
  • Splits: 99.6% train / 0.2% validation / 0.2% test

Data Processing

  • Error pattern removal and quality filtering
  • Balanced train/validation/test splits
  • Character length filtering for optimal training
  • Duplicate detection and removal

Performance & Use Cases

Intended Applications

  • Literature Search: Semantic search across biomedical publications
  • Research Discovery: Finding related papers and concepts
  • Knowledge Mining: Extracting relationships from scientific text
  • Document Classification: Categorizing biomedical documents
  • Similarity Analysis: Comparing research abstracts and papers

Biomedical Domains

  • Molecular Biology
  • Clinical Medicine
  • Pharmacology
  • Genetics & Genomics
  • Biochemistry
  • Neuroscience
  • Public Health

Usage

Installation

pip install sentence-transformers

Basic Usage

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('EMBO/soda-vec-negative-sampling')

# Encode biomedical texts
texts = [
    "CRISPR-Cas9 gene editing in human embryos",
    "mRNA vaccine efficacy against COVID-19 variants",
    "Protein folding mechanisms in neurodegenerative diseases"
]

embeddings = model.encode(texts)
print(f"Embeddings shape: {embeddings.shape}")  # (3, 768)

Semantic Search

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Query and corpus
query = "Alzheimer's disease biomarkers"
corpus = [
    "Tau protein aggregation in neurodegeneration",
    "COVID-19 vaccine development strategies", 
    "Beta-amyloid plaques in dementia patients"
]

# Encode
query_embedding = model.encode([query])
corpus_embeddings = model.encode(corpus)

# Find most similar
similarities = cosine_similarity(query_embedding, corpus_embeddings)[0]
best_match = np.argmax(similarities)
print(f"Best match: {corpus[best_match]} (similarity: {similarities[best_match]:.3f})")

Training Details

Loss Function

The model uses MultipleNegativesRankingLoss, which:

  • Treats all other samples in a batch as negatives
  • Optimizes for high similarity between related texts
  • Provides robust contrastive learning without explicit negative sampling
  • Well-established in sentence-transformers ecosystem

Training Process

  • Duration: ~4 days on 4x V100 GPUs
  • Steps: 310,239 total training steps
  • Evaluation: Every 1000 steps (310 evaluations, 1.8% overhead)
  • Monitoring: Real-time TensorBoard logging
  • Checkpointing: Model saved at end of each epoch

Optimization Features

  • Gradient clipping (max_norm=5.0) for training stability
  • Weight decay regularization for generalization
  • Cosine learning rate scheduling
  • Loss-only evaluation for efficiency
  • Reproducible training (seed=42)

Technical Specifications

Hardware Requirements

  • Training: 4x Tesla V100-DGXS-32GB (recommended)
  • Inference: Any GPU with 4GB+ VRAM, or CPU
  • Memory: ~2GB GPU memory for inference

Software Dependencies

  • sentence-transformers >= 2.0.0
  • transformers >= 4.20.0
  • torch >= 1.12.0
  • Python >= 3.8

Comparison with SODA-VEC (VICReg)

Feature SODA-VEC (VICReg) SODA-VEC Negative Sampling
Loss Function VICReg (custom biomedical) MultipleNegativesRankingLoss
Optimization Empirically tuned coefficients Standard contrastive learning
Training Data Same (26.5M pairs) Same (26.5M pairs)
Use Case Biomedical research focus General semantic similarity
Framework Custom implementation sentence-transformers standard

Limitations

  • Domain Specificity: Optimized for biomedical text, may not generalize to other domains
  • Language: English-only training data
  • Recency: Training data cutoff at July 2024
  • Bias: May reflect biases present in PubMed literature

Citation

If you use this model in your research, please cite:

@misc{soda-vec-negative-sampling-2024,
  title={SODA-VEC Negative Sampling: Biomedical Sentence Embeddings},
  author={EMBO},
  year={2024},
  url={https://huggingface.co/EMBO/soda-vec-negative-sampling},
  note={Trained on 26.5M PubMed text pairs using MultipleNegativesRankingLoss}
}

License

This model is released under the same license as the base ModernBERT model. Please refer to the original model card for licensing details.

Acknowledgments

  • Base Model: nomic-ai/modernbert-embed-base
  • Training Framework: sentence-transformers
  • Data Source: PubMed/MEDLINE database
  • Infrastructure: EMBO computational resources

Model Card Contact

For questions about this model, please contact EMBO or open an issue in the associated repository.


Last Updated: August 2024
Model Version: 1.0
Training Completion: In Progress (ETA: 4 days)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support