T-VEC: A Telecom-Specific Text Embedding Model

Overview

T-VEC (Telecom Vectorization Model) is a domain-adapted text embedding model developed by NetoAI and fine-tuned from Alibaba-NLP/gte-Qwen2-1.5B-instruct. Using a deeply supervised triplet-loss approach, T-VEC learns rich semantic representations tailored to telecom use cases, achieving state-of-the-art results on custom and standard benchmarks.

Model Details

Model Name: T-VEC
Developer: NetoAI
Base Model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
Parameters: 1.5 Billion
Embedding Dimension: 1536
Max Input Tokens: 32,000
Languages: Multilingual (optimized for English)
License: MIT
Tokenizer: Custom telecom-specific tokenizer (open-source)

Intended Uses

Semantic search over telecom documents (3GPP standards, vendor manuals)
Fault log analysis for root-cause detection
Telecom-specific chatbots and Q&A systems
Regulatory compliance analysis and semantic auditing

Training Details

Objective: Triplet loss using cosine similarity
Dataset: 100k+ telecom triplets curated by domain experts over 3 months
Layer Modification: 338 transformer layers fine-tuned
Avg. L2 Norm Weight Change: 0.7735
Enhancements: Telecom-specific tokenizer and query-aware anchor strategies

Evaluation Results

Benchmark	Metric	Score
Telecom Triplet Benchmark	Accuracy	0.9380
MTEB Benchmark	Accuracy	0.825
STS Benchmark	Spearman Correlation	82.19
AllNLI Triplet	Accuracy	0.6150

T-VEC significantly outperforms both its base model and other strong general-purpose models on telecom-specific benchmarks, while still retaining competitive general performance.

Model	ArguAna	SciDocsRR	STS12	STS13	STS14	STS15	STS16	STSBenchmark
gte‑Qwen2‑1.5B‑instruct	0.62335	0.81558	0.72805	0.84699	0.78803	0.87450	0.84938	0.85379
T‑VEC	0.61150	0.83970	0.80320	0.88220	0.82750	0.88260	0.84780	0.88050
all‑MiniLM‑L6‑v2	0.50167	0.87119	0.72369	0.80603	0.75589	0.85390	0.78989	0.82032
all‑mpnet‑base‑v2	0.46521	0.88654	0.72634	0.83485	0.78000	0.85663	0.80030	0.83422
bge‑base‑en‑v1.5	0.63616	0.87494	0.78028	0.84184	0.82273	0.87957	0.85474	0.86418
e5‑base‑v2	0.51604	0.82834	0.73489	0.82997	0.80446	0.88181	0.83659	0.85480
jina‑embeddings‑v2‑base‑en	0.44152	0.83106	0.74278	0.84177	0.78808	0.87553	0.85347	0.84842
instructor‑xl	0.54884	0.79538	0.74085	0.85046	0.80318	0.88359	0.83784	0.83048
gte‑base	0.57151	0.87083	0.75707	0.85729	0.81510	0.88810	0.83824	0.85738
multilingual‑e5‑base	0.47829	0.80392	0.77933	0.76890	0.77535	0.88373	0.82699	0.84201

Limitations

Reduced performance on non-domain tasks (e.g., AllNLI) due to specialization
Large size may impact deployment on edge devices
May miss recent telecom developments outside the training set

Ethical Considerations

Use in critical telecom systems should be validated by domain experts
May reflect terminology biases from dominant vendors in the dataset
Open licensing (MIT) supports transparency and community contributions

Usage

Installation

pip install transformers

Load and Run

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("netoai/t-vec")
tokenizer = AutoTokenizer.from_pretrained("netoai/t-vec")

texts = ["5G NR architecture", "LTE handover", "Core network functions"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=32000)
emb = model(**inputs).last_hidden_state.mean(dim=1)

cos_sim = torch.nn.functional.cosine_similarity(emb[0:1], emb[1:], dim=1)
print(cos_sim)

Citation

@article{ethiraj2025tvec,
  title={T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning},
  author={Ethiraj, Vignesh and Menon, Sidhanth and Vijay, Divya},
  journal={arXiv preprint},
  year={2025},
  url={https://arxiv.org/abs/2504.16460}
}

References

Ethiraj, V., Menon, S., & Vijay, D. (2025). T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning. arXiv:2504.16460.
Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR, 2015.
Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” arXiv:1703.07737, 2017.
Reimers, N., Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019.
Gao, T., Yao, X., Chen, D. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv:2104.08821, 2021.
Gururangan, S., et al. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” ACL, 2020.
Lee, J., Yoon, W., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics, 2020.
Sahu, S. K., Maheshwari, A. “Automatic extraction of telecom network events from log messages.” IEEE ICC, 2018.
Wang, X., Li, Y., Han, J. “Log2Vec: A Deep Embedding Model for Network Log Analysis.” IEEE/IFIP DSN, 2021.

Contact

For questions or contributions, visit https://www.netoai.ai.

NetoAISolutions
/

T-VEC

You need to agree to share your contact information to access this model