You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Please provide answers to the below questions to gain access to the model

Log in or Sign Up to review the conditions and access this model content.

T-VEC: A Telecom-Specific Text Embedding Model

Overview

T-VEC (Telecom Vectorization Model) is a domain-adapted text embedding model developed by NetoAI and fine-tuned from Alibaba-NLP/gte-Qwen2-1.5B-instruct. Using a deeply supervised triplet-loss approach, T-VEC learns rich semantic representations tailored to telecom use cases, achieving state-of-the-art results on custom and standard benchmarks.

Model Details

  • Model Name: T-VEC
  • Developer: NetoAI
  • Base Model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
  • Parameters: 1.5 Billion
  • Embedding Dimension: 1536
  • Max Input Tokens: 32,000
  • Languages: Multilingual (optimized for English)
  • License: MIT
  • Tokenizer: Custom telecom-specific tokenizer (open-source)

Intended Uses

  • Semantic search over telecom documents (3GPP standards, vendor manuals)
  • Fault log analysis for root-cause detection
  • Telecom-specific chatbots and Q&A systems
  • Regulatory compliance analysis and semantic auditing

Training Details

  • Objective: Triplet loss using cosine similarity
  • Dataset: 100k+ telecom triplets curated by domain experts over 3 months
  • Layer Modification: 338 transformer layers fine-tuned
  • Avg. L2 Norm Weight Change: 0.7735
  • Enhancements: Telecom-specific tokenizer and query-aware anchor strategies

Evaluation Results

Benchmark Metric Score
Telecom Triplet Benchmark Accuracy 0.9380
MTEB Benchmark Accuracy 0.825
STS Benchmark Spearman Correlation 82.19
AllNLI Triplet Accuracy 0.6150

T-VEC significantly outperforms both its base model and other strong general-purpose models on telecom-specific benchmarks, while still retaining competitive general performance.

Model ArguAna SciDocsRR STS12 STS13 STS14 STS15 STS16 STSBenchmark
gte‑Qwen2‑1.5B‑instruct 0.62335 0.81558 0.72805 0.84699 0.78803 0.87450 0.84938 0.85379
T‑VEC 0.61150 0.83970 0.80320 0.88220 0.82750 0.88260 0.84780 0.88050
all‑MiniLM‑L6‑v2 0.50167 0.87119 0.72369 0.80603 0.75589 0.85390 0.78989 0.82032
all‑mpnet‑base‑v2 0.46521 0.88654 0.72634 0.83485 0.78000 0.85663 0.80030 0.83422
bge‑base‑en‑v1.5 0.63616 0.87494 0.78028 0.84184 0.82273 0.87957 0.85474 0.86418
e5‑base‑v2 0.51604 0.82834 0.73489 0.82997 0.80446 0.88181 0.83659 0.85480
jina‑embeddings‑v2‑base‑en 0.44152 0.83106 0.74278 0.84177 0.78808 0.87553 0.85347 0.84842
instructor‑xl 0.54884 0.79538 0.74085 0.85046 0.80318 0.88359 0.83784 0.83048
gte‑base 0.57151 0.87083 0.75707 0.85729 0.81510 0.88810 0.83824 0.85738
multilingual‑e5‑base 0.47829 0.80392 0.77933 0.76890 0.77535 0.88373 0.82699 0.84201

image/png

Limitations

  • Reduced performance on non-domain tasks (e.g., AllNLI) due to specialization
  • Large size may impact deployment on edge devices
  • May miss recent telecom developments outside the training set

Ethical Considerations

  • Use in critical telecom systems should be validated by domain experts
  • May reflect terminology biases from dominant vendors in the dataset
  • Open licensing (MIT) supports transparency and community contributions

Usage

Installation

pip install transformers

Load and Run

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("netoai/t-vec")
tokenizer = AutoTokenizer.from_pretrained("netoai/t-vec")

texts = ["5G NR architecture", "LTE handover", "Core network functions"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=32000)
emb = model(**inputs).last_hidden_state.mean(dim=1)

cos_sim = torch.nn.functional.cosine_similarity(emb[0:1], emb[1:], dim=1)
print(cos_sim)

Citation

@article{ethiraj2025tvec,
  title={T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning},
  author={Ethiraj, Vignesh and Menon, Sidhanth and Vijay, Divya},
  journal={arXiv preprint},
  year={2025},
  url={https://arxiv.org/abs/2504.16460}
}

References

  • Ethiraj, V., Menon, S., & Vijay, D. (2025). T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning. arXiv:2504.16460.
  • Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR, 2015.
  • Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” arXiv:1703.07737, 2017.
  • Reimers, N., Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019.
  • Gao, T., Yao, X., Chen, D. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv:2104.08821, 2021.
  • Gururangan, S., et al. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” ACL, 2020.
  • Lee, J., Yoon, W., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics, 2020.
  • Sahu, S. K., Maheshwari, A. “Automatic extraction of telecom network events from log messages.” IEEE ICC, 2018.
  • Wang, X., Li, Y., Han, J. “Log2Vec: A Deep Embedding Model for Network Log Analysis.” IEEE/IFIP DSN, 2021.

Contact


Downloads last month
417
Safetensors
Model size
1.54B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NetoAISolutions/T-VEC

Finetuned
(19)
this model

Evaluation results