SemCSE Model Card

The SemCSE model is an embedding model for scientific abstracts and sentences in general, usable for clustering, retrieval, and many other embedding-related applications. The novelty of our approach is the focus on embeddings that accurately reflect the semantics of the paper, which was lacking for many existing approaches that were trained using citation information.

The novel, semantically-oriented training procedure leads to state-of-the-art resutls on our novel semantic embedding benchmark (please see our paper for details), as well as to state-of-the-art results for models of its size on the established SciRepEval benchmark.

Note that this model uses Euclidean distance for computing similarity in its embedding space. If you prefer using cosine similarity, please use this version of the model.

Model Details

Model Description

  • Developed by: CLAUSE group at Bielefeld University
  • Model type: DeBERTa v2
  • Languages: Mostly english
  • Finetuned from model: KISTI-AI/Scideberta-full

Model Sources

How to Get Started with the Model

Minimal example on how to create embeddings with our model:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("CLAUSE-Bielefeld/SemCSE")
model = AutoModel.from_pretrained("CLAUSE-Bielefeld/SemCSE")

text = "Your text to be embedded."
batch = tokenizer([text], return_tensors="pt")
embedding = model(**batch)["last_hidden_state"][0, 0]

Training Details

This model was trained on a dataset of summaries for 350K scientific abstracts from various domains. We used a triplet loss to encourage summaries of the same abstract to be placed nearby in the embedding space. The dataset and exact training procedure can be found in our GitHub repo,

Evaluation

We introduce a novel semantic scientific embedding benchmark:

Model Params Title-Abstract ↓ Abstract-Segments ↓ Query ↓ Clustering ↑ Perf. ↑
SciBERT 109M 807.74 214.37 213.45 0.569 0.000
SciDeBERTa 183M 1479.09 861.55 2465.26 0.460 0.000
SPECTER 109M 10.25 12.23 2.18 0.692 0.119
SciNCL 109M 5.68 7.35 2.29 0.702 0.357
SPECTER2 (base) 109M 4.52 5.10 1.17 0.666 0.553
SPECTER2 (proximity) 110M 5.34 5.80 1.46 0.666 0.395
all-MiniLM-L6-v2 22M 3.09 8.19 1.11 0.730 0.771
Jina-v2 137M 3.29 8.77 1.29 0.703 0.600
Jina-v3 572M 3.45 6.96 1.01 0.719 0.783
RoBERTa SimCSE 355M 23.71 44.24 8.92 0.696 0.116
NvEmbed-V2 7.9B 3.38 3.84 1.02 0.721 0.866
SemCSE (Ours) 183M 2.47 2.68 1.23 0.739 0.925

Notes:

  • Bold = Best result
  • Underlined = Second-best result
  • ↓ = Lower is better (ranking-based tasks)
  • ↑ = Higher is better (clustering and overall performance)

We also evaluate SemCSE on the SciRepEval benchmark:

Model Parameters Classification ↑ Regression ↑ Proximity ↑ Search ↑ Average ↑
SciBERT 109M 63.86 27.34 66.25 68.19 57.42
SciDeBERTa 183M 60.99 27.00 62.74 67.83 55.18
SPECTER 109M 67.73 25.37 80.05 74.89 64.28
SciNCL 109M 68.04 25.22 81.18 77.32 65.08
SPECTER2 base 109M 66.95 27.75 81.10 78.42 65.46
SPECTER2 proximity 110M 66.37 26.85 81.41 77.75 65.15
all-MiniLM-L6-v2 22M 64.04 20.06 80.74 79.63 63.05
jina-v2 137M 63.99 23.76 80.11 80.40 63.69
jina-v3 572M 65.66 24.84 79.98 80.60 64.34
RoBERTa SimCSE 355M 67.16 22.95 75.51 76.97 62.10
NvEmbed-V2 7.9B 65.62 29.94 81.16 82.84 66.19
SemCSE (Ours) 183M 69.52 27.58 80.21 78.56 65.76

Notes:

  • Bold = Best result
  • Underlined = Second-best result
  • ↑ = Higher is better

Citation

BibTeX:

@misc{brinner2025semcsesemanticcontrastivesentence,
      title={SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts}, 
      author={Marc Brinner and Sina Zarriess},
      year={2025},
      eprint={2507.13105},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.13105}, 
}
Downloads last month
28
Safetensors
Model size
184M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for CLAUSE-Bielefeld/SemCSE

Finetuned
(3)
this model