You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
Please provide answers to the below questions to gain access to the model
Log in or Sign Up to review the conditions and access this model content.
T-VEC: A Telecom-Specific Text Embedding Model
Overview
T-VEC (Telecom Vectorization Model) is a domain-adapted text embedding model developed by NetoAI and fine-tuned from Alibaba-NLP/gte-Qwen2-1.5B-instruct. Using a deeply supervised triplet-loss approach, T-VEC learns rich semantic representations tailored to telecom use cases, achieving state-of-the-art results on custom and standard benchmarks.
Model Details
- Model Name: T-VEC
- Developer: NetoAI
- Base Model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
- Parameters: 1.5 Billion
- Embedding Dimension: 1536
- Max Input Tokens: 32,000
- Languages: Multilingual (optimized for English)
- License: MIT
- Tokenizer: Custom telecom-specific tokenizer (open-source)
Intended Uses
- Semantic search over telecom documents (3GPP standards, vendor manuals)
- Fault log analysis for root-cause detection
- Telecom-specific chatbots and Q&A systems
- Regulatory compliance analysis and semantic auditing
Training Details
- Objective: Triplet loss using cosine similarity
- Dataset: 100k+ telecom triplets curated by domain experts over 3 months
- Layer Modification: 338 transformer layers fine-tuned
- Avg. L2 Norm Weight Change: 0.7735
- Enhancements: Telecom-specific tokenizer and query-aware anchor strategies
Evaluation Results
Benchmark | Metric | Score |
---|---|---|
Telecom Triplet Benchmark | Accuracy | 0.9380 |
MTEB Benchmark | Accuracy | 0.825 |
STS Benchmark | Spearman Correlation | 82.19 |
AllNLI Triplet | Accuracy | 0.6150 |
T-VEC significantly outperforms both its base model and other strong general-purpose models on telecom-specific benchmarks, while still retaining competitive general performance.
Model | ArguAna | SciDocsRR | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark |
---|---|---|---|---|---|---|---|---|
gte‑Qwen2‑1.5B‑instruct | 0.62335 | 0.81558 | 0.72805 | 0.84699 | 0.78803 | 0.87450 | 0.84938 | 0.85379 |
T‑VEC | 0.61150 | 0.83970 | 0.80320 | 0.88220 | 0.82750 | 0.88260 | 0.84780 | 0.88050 |
all‑MiniLM‑L6‑v2 | 0.50167 | 0.87119 | 0.72369 | 0.80603 | 0.75589 | 0.85390 | 0.78989 | 0.82032 |
all‑mpnet‑base‑v2 | 0.46521 | 0.88654 | 0.72634 | 0.83485 | 0.78000 | 0.85663 | 0.80030 | 0.83422 |
bge‑base‑en‑v1.5 | 0.63616 | 0.87494 | 0.78028 | 0.84184 | 0.82273 | 0.87957 | 0.85474 | 0.86418 |
e5‑base‑v2 | 0.51604 | 0.82834 | 0.73489 | 0.82997 | 0.80446 | 0.88181 | 0.83659 | 0.85480 |
jina‑embeddings‑v2‑base‑en | 0.44152 | 0.83106 | 0.74278 | 0.84177 | 0.78808 | 0.87553 | 0.85347 | 0.84842 |
instructor‑xl | 0.54884 | 0.79538 | 0.74085 | 0.85046 | 0.80318 | 0.88359 | 0.83784 | 0.83048 |
gte‑base | 0.57151 | 0.87083 | 0.75707 | 0.85729 | 0.81510 | 0.88810 | 0.83824 | 0.85738 |
multilingual‑e5‑base | 0.47829 | 0.80392 | 0.77933 | 0.76890 | 0.77535 | 0.88373 | 0.82699 | 0.84201 |
Limitations
- Reduced performance on non-domain tasks (e.g., AllNLI) due to specialization
- Large size may impact deployment on edge devices
- May miss recent telecom developments outside the training set
Ethical Considerations
- Use in critical telecom systems should be validated by domain experts
- May reflect terminology biases from dominant vendors in the dataset
- Open licensing (MIT) supports transparency and community contributions
Usage
Installation
pip install transformers
Load and Run
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained("netoai/t-vec")
tokenizer = AutoTokenizer.from_pretrained("netoai/t-vec")
texts = ["5G NR architecture", "LTE handover", "Core network functions"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=32000)
emb = model(**inputs).last_hidden_state.mean(dim=1)
cos_sim = torch.nn.functional.cosine_similarity(emb[0:1], emb[1:], dim=1)
print(cos_sim)
Citation
@article{ethiraj2025tvec,
title={T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning},
author={Ethiraj, Vignesh and Menon, Sidhanth and Vijay, Divya},
journal={arXiv preprint},
year={2025},
url={https://arxiv.org/abs/2504.16460}
}
References
- Ethiraj, V., Menon, S., & Vijay, D. (2025). T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning. arXiv:2504.16460.
- Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR, 2015.
- Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” arXiv:1703.07737, 2017.
- Reimers, N., Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019.
- Gao, T., Yao, X., Chen, D. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv:2104.08821, 2021.
- Gururangan, S., et al. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” ACL, 2020.
- Lee, J., Yoon, W., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics, 2020.
- Sahu, S. K., Maheshwari, A. “Automatic extraction of telecom network events from log messages.” IEEE ICC, 2018.
- Wang, X., Li, Y., Han, J. “Log2Vec: A Deep Embedding Model for Network Log Analysis.” IEEE/IFIP DSN, 2021.
Contact
- For questions or contributions, visit https://www.netoai.ai.
- Downloads last month
- 417
Model tree for NetoAISolutions/T-VEC
Base model
Alibaba-NLP/gte-Qwen2-1.5B-instructEvaluation results
- Telecom Triplet Score on Telecom Triplet Benchmarkself-reported0.938
- Average MTEB Score on MTEB Benchmarkself-reported0.825
- Average STS Score on STS Benchmarkself-reported82.190
- Triplet Score on AllNLI Tripletself-reported0.615