SemCSE Model Card
The SemCSE model is an embedding model for scientific abstracts and sentences in general, usable for clustering, retrieval, and many other embedding-related applications. The novelty of our approach is the focus on embeddings that accurately reflect the semantics of the paper, which was lacking for many existing approaches that were trained using citation information.
The novel, semantically-oriented training procedure leads to state-of-the-art resutls on our novel semantic embedding benchmark (please see our paper for details), as well as to state-of-the-art results for models of its size on the established SciRepEval benchmark.
Note that this model uses Euclidean distance for computing similarity in its embedding space. If you prefer using cosine similarity, please use this version of the model.
Model Details
Model Description
- Developed by: CLAUSE group at Bielefeld University
- Model type: DeBERTa v2
- Languages: Mostly english
- Finetuned from model: KISTI-AI/Scideberta-full
Model Sources
- Repository: github.com/inas-argumentation/SemCSE
- Paper: https://arxiv.org/abs/2507.13105
How to Get Started with the Model
Minimal example on how to create embeddings with our model:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("CLAUSE-Bielefeld/SemCSE")
model = AutoModel.from_pretrained("CLAUSE-Bielefeld/SemCSE")
text = "Your text to be embedded."
batch = tokenizer([text], return_tensors="pt")
embedding = model(**batch)["last_hidden_state"][0, 0]
Training Details
This model was trained on a dataset of summaries for 350K scientific abstracts from various domains. We used a triplet loss to encourage summaries of the same abstract to be placed nearby in the embedding space. The dataset and exact training procedure can be found in our GitHub repo,
Evaluation
We introduce a novel semantic scientific embedding benchmark:
Model | Params | Title-Abstract β | Abstract-Segments β | Query β | Clustering β | Perf. β |
---|---|---|---|---|---|---|
SciBERT | 109M | 807.74 | 214.37 | 213.45 | 0.569 | 0.000 |
SciDeBERTa | 183M | 1479.09 | 861.55 | 2465.26 | 0.460 | 0.000 |
SPECTER | 109M | 10.25 | 12.23 | 2.18 | 0.692 | 0.119 |
SciNCL | 109M | 5.68 | 7.35 | 2.29 | 0.702 | 0.357 |
SPECTER2 (base) | 109M | 4.52 | 5.10 | 1.17 | 0.666 | 0.553 |
SPECTER2 (proximity) | 110M | 5.34 | 5.80 | 1.46 | 0.666 | 0.395 |
all-MiniLM-L6-v2 | 22M | 3.09 | 8.19 | 1.11 | 0.730 | 0.771 |
Jina-v2 | 137M | 3.29 | 8.77 | 1.29 | 0.703 | 0.600 |
Jina-v3 | 572M | 3.45 | 6.96 | 1.01 | 0.719 | 0.783 |
RoBERTa SimCSE | 355M | 23.71 | 44.24 | 8.92 | 0.696 | 0.116 |
NvEmbed-V2 | 7.9B | 3.38 | 3.84 | 1.02 | 0.721 | 0.866 |
SemCSE (Ours) | 183M | 2.47 | 2.68 | 1.23 | 0.739 | 0.925 |
Notes:
- Bold = Best result
- Underlined = Second-best result
- β = Lower is better (ranking-based tasks)
- β = Higher is better (clustering and overall performance)
We also evaluate SemCSE on the SciRepEval benchmark:
Model | Parameters | Classification β | Regression β | Proximity β | Search β | Average β |
---|---|---|---|---|---|---|
SciBERT | 109M | 63.86 | 27.34 | 66.25 | 68.19 | 57.42 |
SciDeBERTa | 183M | 60.99 | 27.00 | 62.74 | 67.83 | 55.18 |
SPECTER | 109M | 67.73 | 25.37 | 80.05 | 74.89 | 64.28 |
SciNCL | 109M | 68.04 | 25.22 | 81.18 | 77.32 | 65.08 |
SPECTER2 base | 109M | 66.95 | 27.75 | 81.10 | 78.42 | 65.46 |
SPECTER2 proximity | 110M | 66.37 | 26.85 | 81.41 | 77.75 | 65.15 |
all-MiniLM-L6-v2 | 22M | 64.04 | 20.06 | 80.74 | 79.63 | 63.05 |
jina-v2 | 137M | 63.99 | 23.76 | 80.11 | 80.40 | 63.69 |
jina-v3 | 572M | 65.66 | 24.84 | 79.98 | 80.60 | 64.34 |
RoBERTa SimCSE | 355M | 67.16 | 22.95 | 75.51 | 76.97 | 62.10 |
NvEmbed-V2 | 7.9B | 65.62 | 29.94 | 81.16 | 82.84 | 66.19 |
SemCSE (Ours) | 183M | 69.52 | 27.58 | 80.21 | 78.56 | 65.76 |
Notes:
- Bold = Best result
- Underlined = Second-best result
- β = Higher is better
Citation
BibTeX:
@misc{brinner2025semcsesemanticcontrastivesentence,
title={SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts},
author={Marc Brinner and Sina Zarriess},
year={2025},
eprint={2507.13105},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.13105},
}
- Downloads last month
- 28
Model tree for CLAUSE-Bielefeld/SemCSE
Base model
KISTI-AI/Scideberta-full