SemCSE Model Card

The SemCSE model is an embedding model for scientific abstracts and sentences in general, usable for clustering, retrieval, and many other embedding-related applications. The novelty of our approach is the focus on embeddings that accurately reflect the semantics of the paper, which was lacking for many existing approaches that were trained using citation information.

The novel, semantically-oriented training procedure leads to state-of-the-art resutls on our novel semantic embedding benchmark (please see our paper for details), as well as to state-of-the-art results for models of its size on the established SciRepEval benchmark.

Note that this model uses Euclidean distance for computing similarity in its embedding space. If you prefer using cosine similarity, please use this version of the model.

Model Details

Model Description

Developed by: CLAUSE group at Bielefeld University
Model type: DeBERTa v2
Languages: Mostly english
Finetuned from model: KISTI-AI/Scideberta-full

Model Sources

Repository: github.com/inas-argumentation/SemCSE
Paper: https://arxiv.org/abs/2507.13105

How to Get Started with the Model

Minimal example on how to create embeddings with our model:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("CLAUSE-Bielefeld/SemCSE")
model = AutoModel.from_pretrained("CLAUSE-Bielefeld/SemCSE")

text = "Your text to be embedded."
batch = tokenizer([text], return_tensors="pt")
embedding = model(**batch)["last_hidden_state"][0, 0]

Training Details

This model was trained on a dataset of summaries for 350K scientific abstracts from various domains. We used a triplet loss to encourage summaries of the same abstract to be placed nearby in the embedding space. The dataset and exact training procedure can be found in our GitHub repo,

Evaluation

We introduce a novel semantic scientific embedding benchmark:

Model	Params	Title-Abstract ↓	Abstract-Segments ↓	Query ↓	Clustering ↑	Perf. ↑
SciBERT	109M	807.74	214.37	213.45	0.569	0.000
SciDeBERTa	183M	1479.09	861.55	2465.26	0.460	0.000
SPECTER	109M	10.25	12.23	2.18	0.692	0.119
SciNCL	109M	5.68	7.35	2.29	0.702	0.357
SPECTER2 (base)	109M	4.52	5.10	1.17	0.666	0.553
SPECTER2 (proximity)	110M	5.34	5.80	1.46	0.666	0.395
all-MiniLM-L6-v2	22M	3.09	8.19	1.11	0.730	0.771
Jina-v2	137M	3.29	8.77	1.29	0.703	0.600
Jina-v3	572M	3.45	6.96	1.01	0.719	0.783
RoBERTa SimCSE	355M	23.71	44.24	8.92	0.696	0.116
NvEmbed-V2	7.9B	3.38	3.84	1.02	0.721	0.866
SemCSE (Ours)	183M	2.47	2.68	1.23	0.739	0.925

Notes:

Bold = Best result
Underlined = Second-best result
↓ = Lower is better (ranking-based tasks)
↑ = Higher is better (clustering and overall performance)

We also evaluate SemCSE on the SciRepEval benchmark:

Model	Parameters	Classification ↑	Regression ↑	Proximity ↑	Search ↑	Average ↑
SciBERT	109M	63.86	27.34	66.25	68.19	57.42
SciDeBERTa	183M	60.99	27.00	62.74	67.83	55.18
SPECTER	109M	67.73	25.37	80.05	74.89	64.28
SciNCL	109M	68.04	25.22	81.18	77.32	65.08
SPECTER2 base	109M	66.95	27.75	81.10	78.42	65.46
SPECTER2 proximity	110M	66.37	26.85	81.41	77.75	65.15
all-MiniLM-L6-v2	22M	64.04	20.06	80.74	79.63	63.05
jina-v2	137M	63.99	23.76	80.11	80.40	63.69
jina-v3	572M	65.66	24.84	79.98	80.60	64.34
RoBERTa SimCSE	355M	67.16	22.95	75.51	76.97	62.10
NvEmbed-V2	7.9B	65.62	29.94	81.16	82.84	66.19
SemCSE (Ours)	183M	69.52	27.58	80.21	78.56	65.76

Notes:

Bold = Best result
Underlined = Second-best result
↑ = Higher is better

Citation

BibTeX:

@misc{brinner2025semcsesemanticcontrastivesentence,
      title={SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts}, 
      author={Marc Brinner and Sina Zarriess},
      year={2025},
      eprint={2507.13105},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.13105}, 
}

CLAUSE-Bielefeld
/

SemCSE