
Trained by Donkey Stereotype
Independent Implementation of ColBERTv2.0+ Models
Background: As part of this project, we will be releasing a set of models across weight classes: 1.) Models that worked well, 2.) Experimental models, including failed attempts. This work stands on the shoulders of all previous robust research on ColBERT and variants.What this independent implementation entail?
- This is a humble effort to independently implement Lighton AI's GTE-ModernColBERT .
- Without using existing ColBERT libraries (or codebase) like PyLate or Stanford's recipe.
- Without any funding, grand GPU budgets, or formal research background.
As of this writing (2nd July 2025)
- LightOn AI's ColBERT is the best in the world and can be considered SOTA.
- Today we are humbled and thrilled to announce prithivida/modern_colbert_base_en_v1 is the 2nd best ColBERT in the world.. Borrowing Antoine Chaffin's words -
This is the 2nd model to outperform ColBERT-small on BEIR While it is also bigger, it is still a very lightweight model and benefits from the efficiency of ModernBERT!"
Comparison with Top ColBERTv2.0+ Models
Dataset / Model | GTE-ModernColBERT (Lighton AI) |
modern_colbert_base_en_v1 (Ours) |
ColBERT-small (Answer AI, reproduced by Lighton) |
ColBERT-small (Answer AI, reported) |
---|---|---|---|---|
Outfit type | AI Lab with PhDs |
Indie Researcher, No PhD, No GPU budgets :-) |
AI Lab with PhDs | AI Lab with PhDs |
BEIR Average | 54.89 (๐ฅ) | 54.51 (๐ฅ) | 53.35 | 53.79 |
FiQA2018 | 48.51 | 43.96 | 41.01 | 41.15 |
NFCorpus | 37.93 | 37.23 | 36.86 | 37.3 |
TREC-COVID | 83.59 | 83.4 | 83.14 | 84.59 |
Touche2020 | 31.23 | 29.32 | 24.95 | 25.69 |
ArguAna | 48.51 | 52.05 | 46.76 | 50.09 |
QuoraRetrieval | 86.61 | 87.54 | 87.89 | 87.72 |
SCIDOCS | 19.06 | 19.42 | 18.72 | 18.42 |
SciFact | 76.34 | 76.44 | 74.02 | 74.77 |
NQ | 61.8 | 61.68 | 59.42 | 59.1 |
ClimateFEVER | 30.62 | 28.29 | 32.83 | 33.07 |
HotpotQA | 77.32 | 76.667 | 76.88 | 76.11 |
DBPedia | 48.03 | 46.31 | 46.36 | 45.58 |
CQADupstack | 41 | 42.2 | 39.36 | 38.75 |
FEVER | 87.44 | 88.106 | 88.66 | 90.96 |
MSMARCO | 45.32 | 44.993 | 43.44 | 43.5 |
- Turns out a very modest GPU budget, a humble background and high quality hard negative mining is a good strart to independently implement the ColBERT's that are in circulation today.
- detailed BEIR eval numbers
- nanoBEIR eval results
Comparison of with legacy ColBERT models
Both GTE-ModernColBERT and ColBERT-small model cards have this comparison against older Colbert models. please refer to them.
How to use / Running inference:
- Short term: We are releasing a lib called
[lateness]
(https://github.com/PrithivirajDamodaran/lateness) - Medium to Long terms: There are really strong storage and retrieval abstractions: VectorDBs like Qdrant, Weaviate or Vespa that support multi-vectors and strong Colbert training libraries like PyLate, So we feel it is best to work the authors and integrate. For now we offer only code to load the model, run inference and do some light weight in-memory ranking (no heavy lifting like storing and retrieving using FAISS indexes).
Using modern_colbert to index and query with Vectordb's like Qdrant.
pip install lateness # light CPU retrievals
or
pip install lateness[index] # GPU accelerated indexing into vdbs
Want to locally run qdrant or use in production cluster ? try out an end to end example here
from lateness import ModernColBERT
colbert = ModernColBERT("prithivida/modern_colbert_base_en_v1",
max_query_len = 32,
max_doc_len = 300)
documents = [
"PyTorch is an open-source machine learning framework that provides tensor computations with GPU acceleration and deep neural networks built on tape-based autograd system.",
"Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containerized applications across clusters of machines.",
"REST APIs follow representational state transfer architectural style using HTTP methods like GET, POST, PUT, DELETE for stateless client-server communication.",
]
queries = [
"How to build real-time data pipelines?",
"What are the benefits of microservices?",
"How to implement efficient web APIs?"
]
query_embeddings = colbert.encode_queries(queries)
doc_embeddings = colbert.encode_documents(documents)
scores = ModernColBERT.compute_similarity(query_embeddings, doc_embeddings)
print(scores)
Click here for inference code using Transformers
Copy paste the next snippet before running the below snippet.
model_path = "prithivida/modern_colbert_base_en_v1"
try:
colbert = ColBERT.load_for_inference(model_path, max_query_len=32, max_doc_len=300)
# Test data
queries = [
"How does deep learning work?",
"What is machine learning?",
"What are neural networks?"
]
documents = [
"Machine learning is the idea of approximating a real world phenomenon using data, the approximation can be mathmetical or otherwise.",
"Deep learning uses neural networks with multiple layers to process data.",
"Neural networks are computing systems inspired by biological neural networks.",
"Artificial intelligence encompasses machine learning and deep learning.",
]
# Encode and find similarity
print("\n=== Encode and Calculate similarity ===")
q_reps = colbert.encode_queries(queries, batch_size=4, to_cpu=True)
p_reps = colbert.encode_documents(documents, batch_size=4, to_cpu=True)
scores = colbert.compute_similarity(q_reps, p_reps)
print(scores)
# or Test single query ranking
print("\n=== Single Query Ranking ===")
query = "How does deep learning work?"
results = colbert.rank_documents(query, documents, top_k=3)
print(f"Query: {query}")
for i, (doc_idx, score, doc_text) in enumerate(results):
print(f" {i+1}. Score: {score:.4f} | Doc: {doc_text}")
except Exception as e:
print(f"Error during testing: {e}")
import torch
from torch import nn
from transformers import PreTrainedModel, AutoConfig, AutoModel, AutoTokenizer
from transformers.modeling_outputs import BaseModelOutput
from tqdm import tqdm
from typing import List, Tuple, Union, Optional
import string
import os
class TaggingHead(nn.Module):
def __init__(self, input_size, num_labels):
super().__init__()
self.classifier = nn.Linear(input_size, num_labels, bias=False)
nn.init.xavier_uniform_(self.classifier.weight)
def forward(self, x):
return self.classifier(x)
class ColBERT(PreTrainedModel):
config_class = AutoConfig
base_model_prefix = "backbone"
def __init__(self, config):
super().__init__(config)
self.backbone = AutoModel.from_config(config)
hidden_dim = config.hidden_size
self.heads = nn.ModuleDict({
"col_pooling": TaggingHead(hidden_dim, num_labels=128)
})
# Inference settings (will be set when loading for inference)
self.tokenizer = None
self.max_query_len = 256
self.max_doc_len = 300
self.Q_PID = None
self.D_PID = None
def _init_weights(self, module):
if isinstance(module, (nn.Linear, nn.Embedding)):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if isinstance(module, nn.Linear) and module.bias is not None:
module.bias.data.zero_()
def forward(self, input_ids, attention_mask=None, position_ids=None, return_dict=False, **kwargs):
kwargs.pop("token_type_ids", None)
outputs = self.backbone(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
return_dict=True,
**kwargs
)
reps = outputs.last_hidden_state
reps = torch.nn.functional.normalize(reps, p=2, dim=2)
reps *= attention_mask[:, :, None].float()
logits = self.heads["col_pooling"](reps)
if return_dict:
return BaseModelOutput(last_hidden_state=logits)
return logits
@classmethod
def load_for_inference(cls, model_name_or_path: str, max_query_len: int = 256,
max_doc_len: int = 300, device: str = None):
"""
Load ColBERT model with tokenizer for inference
Args:
model_name_or_path: HuggingFace model path or local directory
max_query_len: Maximum query length
max_doc_len: Maximum document length
device: Device to run inference on (auto-detect if None)
"""
device = device or ("cuda" if torch.cuda.is_available() else "cpu")
try:
# Load model and tokenizer
if os.path.exists(model_name_or_path):
print(f"Loading model from local directory: {model_name_or_path}")
config = AutoConfig.from_pretrained(model_name_or_path)
model = cls.from_pretrained(model_name_or_path, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
else:
print(f"Downloading model from HuggingFace Hub: {model_name_or_path}")
config = AutoConfig.from_pretrained(model_name_or_path)
model = cls.from_pretrained(model_name_or_path, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
# Setup inference configuration
model.tokenizer = tokenizer
model.max_query_len = max_query_len
model.max_doc_len = max_doc_len
model.Q_PID = tokenizer.convert_tokens_to_ids("[unused0]")
model.D_PID = tokenizer.convert_tokens_to_ids("[unused1]")
# Setup post-tokenization punctuation masking
model.skip_ids = {tokenizer.encode(c, add_special_tokens=False)[0]
for c in string.punctuation}
model.to(device)
model.eval()
print(f"ColBERT model loaded on {device}")
print(f"Query max length: {max_query_len}, Document max length: {max_doc_len}")
return model
except Exception as e:
print(f"Error loading model: {e}")
raise
def _encode_batch(self, ids: torch.Tensor, mask: torch.Tensor, to_cpu: bool = False):
"""Internal encoding function"""
if self.tokenizer is None:
raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
ids, mask = ids.to(self.device), mask.to(self.device)
pos = torch.arange(ids.size(1), device=self.device).unsqueeze(0).expand_as(ids)
with torch.no_grad():
rep = self(input_ids=ids, attention_mask=mask, position_ids=pos)
return rep.cpu() if to_cpu else rep
def encode_queries(self, queries: List[str], batch_size: Optional[int] = None, to_cpu: bool = False):
"""
Encode queries for ColBERT retrieval
Args:
queries: List of query strings
batch_size: Batch size for processing (None for single batch)
to_cpu: Whether to move results to CPU
Returns:
Query representations tensor
"""
if self.tokenizer is None:
raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
print(f"Encoding {len(queries)} queries...")
# Tokenize with query prefix
enc = self.tokenizer(queries, add_special_tokens=True, truncation=False)
id_lists = [[self.Q_PID] + ids for ids in enc["input_ids"]]
# Apply dynamic augmentation with length cap
cap = self.max_query_len or (self.tokenizer.model_max_length - 1)
id_lists = [_dynamic_augment(ids, self.tokenizer.mask_token_id, cap) for ids in id_lists]
# Pad sequences
padded = self.tokenizer.pad({"input_ids": id_lists}, padding=True, return_tensors="pt")
ids, mask = padded["input_ids"], padded["attention_mask"]
# Process in batches if specified
if batch_size:
reps = []
for i, a in tqdm(_split_into_batches(ids, mask, batch_size), desc="Encoding query batches"):
reps.append(self._encode_batch(i, a, to_cpu))
return torch.cat(reps)
return self._encode_batch(ids, mask, to_cpu)
def encode_documents(self, documents: List[str], batch_size: Optional[int] = None,
keep_dims: bool = True, to_cpu: bool = False):
"""
Encode documents for ColBERT retrieval with post-tokenization punctuation masking
Args:
documents: List of document strings
batch_size: Batch size for processing (None for single batch)
keep_dims: Whether to keep tensor dimensions (True) or return list of variable-length tensors
to_cpu: Whether to move results to CPU
Returns:
Document representations tensor or list
"""
if self.tokenizer is None:
raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
print(f"Encoding {len(documents)} documents...")
# Tokenize documents WITHOUT removing punctuation (post-tokenization masking)
enc = self.tokenizer(documents, add_special_tokens=True,
truncation=True, max_length=self.max_doc_len - 1)
id_lists = [[self.D_PID] + ids for ids in enc["input_ids"]]
# Pad sequences
padded = self.tokenizer.pad({"input_ids": id_lists}, padding=True, return_tensors="pt")
ids, mask = padded["input_ids"], padded["attention_mask"]
# Apply post-tokenization punctuation masking
mask[torch.isin(ids, torch.tensor(list(self.skip_ids), device=ids.device))] = 0
# Process in batches if specified
if batch_size:
ids_s, mask_s, rev = _sort_by_length(ids, mask, batch_size)
reps = []
for i, a in tqdm(_split_into_batches(ids_s, mask_s, batch_size), desc="Encoding document batches"):
rep = self._encode_batch(i, a, to_cpu)
if not keep_dims:
# Convert to list of variable-length tensors
m = a.cpu().bool() if to_cpu else a.bool()
rep = [r[m[idx]] for idx, r in enumerate(rep)]
reps.append(rep)
if keep_dims:
return _stack_3D_tensors(reps)[rev]
else:
# Flatten and reorder
flat = [d for g in reps for d in g]
return [flat[i] for i in rev.tolist()]
# Single batch processing
rep = self._encode_batch(ids, mask, to_cpu)
if not keep_dims:
m = mask.cpu().bool() if to_cpu else mask.bool()
rep = [r[m[idx]] for idx, r in enumerate(rep)]
return rep
def compute_similarity(q_reps: torch.Tensor, p_reps: torch.Tensor):
"""
Compute ColBERT-style max similarity between queries and passages
Args:
q_reps: Query representations [num_queries, max_q_len, dim]
p_reps: Passage representations [num_passages, max_p_len, dim]
Returns:
Similarity scores [num_queries, num_passages]
"""
token_scores = torch.einsum("qin,pjn->qipj", q_reps, p_reps)
scores, _ = token_scores.max(-1)
scores = scores.sum(1)
return scores
def search(self, queries: List[str], documents: List[str],
batch_size: Optional[int] = None, return_scores: bool = True):
"""
End-to-end search: encode queries and documents, compute similarities
Args:
queries: List of query strings
documents: List of document strings
batch_size: Batch size for encoding
return_scores: Whether to return similarity scores
Returns:
If return_scores=True: (scores, query_reps, doc_reps)
If return_scores=False: (query_reps, doc_reps)
"""
# Encode queries and documents
q_reps = self.encode_queries(queries, batch_size=batch_size, to_cpu=True)
p_reps = self.encode_documents(documents, batch_size=batch_size, to_cpu=True)
if return_scores:
# Compute similarities
print("Computing similarities...")
scores = self.compute_similarity(q_reps, p_reps)
return scores, q_reps, p_reps
return q_reps, p_reps
def rank_documents(self, query: str, documents: List[str], top_k: int = 10):
"""
Rank documents for a single query
Args:
query: Query string
documents: List of document strings
top_k: Number of top results to return
Returns:
List of (document_index, score, document_text) tuples
"""
scores, _, _ = self.search([query], documents, return_scores=True)
scores = scores.squeeze(0) # Remove query dimension
# Get top-k results
top_indices = torch.topk(scores, min(top_k, len(documents))).indices
results = []
for idx in top_indices:
results.append((idx.item(), scores[idx].item(), documents[idx.item()]))
return results
# ---------------------------------------------------------------------------
# Helper Functions
# ---------------------------------------------------------------------------
def _split_into_batches(ids: torch.Tensor, mask: torch.Tensor, bsize: int):
return [(ids[i:i + bsize], mask[i:i + bsize])
for i in range(0, ids.size(0), bsize)]
def _sort_by_length(ids: torch.Tensor, mask: torch.Tensor, bsize: int):
if ids.size(0) <= bsize:
return ids, mask, torch.arange(ids.size(0))
lengths = mask.sum(-1)
order = lengths.sort().indices
reverse = order.sort().indices
return ids[order], mask[order], reverse
def _dynamic_augment(ids: List[int], mask_id: int, max_cap: int = None) -> List[int]:
if max_cap is not None and len(ids) > max_cap:
return ids[:max_cap]
q_len = len(ids)
target = max(32, ((q_len + 31) // 32) * 32)
if target - q_len < 8:
target = q_len + 8
if max_cap is not None:
target = min(target, max_cap)
return ids + [mask_id] * (target - q_len)
def _stack_3D_tensors(groups):
bsize = sum(x.size(0) for x in groups)
maxlen = max(x.size(1) for x in groups)
hdim = groups[0].size(2)
out = torch.zeros(bsize, maxlen, hdim, device=groups[0].device, dtype=groups[0].dtype)
ptr = 0
for g in groups:
out[ptr:ptr + g.size(0), :g.size(1)] = g
ptr += g.size(0)
return out
Click here for inference code using ONNX
Copy paste the next snippet before running the below snippet.
model_path = "prithivida/modern_colbert_base_en_v1"
onnx_model_path = "prithivida/modern_colbert_base_en_v1/onnx/model.onnx"
# Load ONNX model for inference using the standalone tokenizer path
onnx_colbert = ONNXColBERT(onnx_model_path, model_path , max_query_len=32, max_doc_len=300) # Pass model_path as tokenizer_path
# Test inference
queries = [
"How does deep learning work?",
"What is machine learning?",
"What are neural networks?"
]
documents = [
"Machine learning is the idea of approximating a real world phenomenon using data, the approximation can be mathmetical or otherwise.",
"Deep learning uses neural networks with multiple layers to process data.",
"Neural networks are computing systems inspired by biological neural networks.",
"Artificial intelligence encompasses machine learning and deep learning.",
]
# Encode and find similarity
print("\n=== ONNX Encode and Compute similarity ===")
q_reps = onnx_colbert.encode_queries(queries, batch_size=4, to_cpu=True)
p_reps = onnx_colbert.encode_documents(documents, batch_size=4, to_cpu=True)
scores = onnx_colbert.compute_similarity(q_reps, p_reps)
# or Test single query ranking
print("\n=== ONNX Standalone Single Query Ranking ===")
query = "How does deep learning work?"
results = onnx_colbert.rank_documents(query, documents, top_k=3)
print(f"Query: {query}")
for i, (doc_idx, score, doc_text) in enumerate(results):
print(f" {i+1}. Score: {score:.4f} | Doc: {doc_text}")
import numpy as np
import onnxruntime as ort
from tokenizers import AddedToken, Tokenizer
import json
import string
from pathlib import Path
from typing import List, Optional, Tuple, Union
from tqdm import tqdm
# ---------------------------------------------------------------------------
# ONNX ColBERT Class
# ---------------------------------------------------------------------------
class ONNXColBERT:
def __init__(self, onnx_model_path: str, tokenizer_path: str,
max_query_len: int = 256, max_doc_len: int = 300,
providers: Optional[List[str]] = None):
"""
ONNX ColBERT - identical to PyTorch ColBERT.load_for_inference()
Args:
onnx_model_path: Path to the ONNX model file
tokenizer_path: Path to the tokenizer directory
max_query_len: Maximum query length
max_doc_len: Maximum document length
providers: ONNX Runtime providers
"""
# Load standalone tokenizer
self.model_dir = Path(tokenizer_path)
self.tokenizer = self._get_tokenizer(max_length=512)
self.max_query_len = max_query_len
self.max_doc_len = max_doc_len
# Setup inference configuration
self.Q_PID = self.tokenizer.token_to_id("[unused0]")
self.D_PID = self.tokenizer.token_to_id("[unused1]")
self.mask_token_id = self.tokenizer.token_to_id("[MASK]")
if None in [self.Q_PID, self.D_PID, self.mask_token_id]:
raise ValueError("Could not find required special tokens in tokenizer")
# Setup post-tokenization punctuation masking
self.skip_ids = set()
for c in string.punctuation:
encoded = self.tokenizer.encode(c, add_special_tokens=False)
if len(encoded.ids) > 0:
self.skip_ids.add(encoded.ids[0])
print(f"Identified {len(self.skip_ids)} punctuation token IDs to skip")
# Initialize ONNX Runtime session
if providers is None:
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
self.session = ort.InferenceSession(onnx_model_path, providers=providers)
print(f"โ
ONNX ColBERT loaded with providers: {self.session.get_providers()}")
print(f"Query max length: {max_query_len}, Document max length: {max_doc_len}")
def _get_tokenizer(self, max_length: int = 512) -> Tokenizer:
"""Initialize tokenizer"""
with open(str(self.model_dir / "config.json")) as config_file:
config = json.load(config_file)
with open(str(self.model_dir / "tokenizer_config.json")) as tokenizer_config_file:
tokenizer_config = json.load(tokenizer_config_file)
with open(str(self.model_dir / "special_tokens_map.json")) as tokens_map_file:
tokens_map = json.load(tokens_map_file)
tokenizer = Tokenizer.from_file(str(self.model_dir / "tokenizer.json"))
tokenizer.enable_truncation(max_length=min(tokenizer_config["model_max_length"], max_length))
tokenizer.enable_padding(pad_id=config["pad_token_id"], pad_token=tokenizer_config["pad_token"])
for token in tokens_map.values():
if isinstance(token, str):
tokenizer.add_special_tokens([token])
elif isinstance(token, dict):
tokenizer.add_special_tokens([AddedToken(**token)])
return tokenizer
def _encode_batch(self, ids: np.ndarray, mask: np.ndarray, to_cpu: bool = False) -> np.ndarray:
"""Internal encoding function"""
# Create position IDs
pos = np.arange(ids.shape[1])[None, :].repeat(ids.shape[0], axis=0)
# ONNX inference
inputs = {
"input_ids": ids.astype(np.int64),
"attention_mask": mask.astype(np.int64),
"position_ids": pos.astype(np.int64)
}
outputs = self.session.run(["last_hidden_state"], inputs)
return outputs[0]
def encode_queries(self, queries: List[str], batch_size: Optional[int] = None,
to_cpu: bool = False) -> np.ndarray:
"""Encode queries - IDENTICAL to PyTorch ColBERT.encode_queries()"""
print(f"Encoding {len(queries)} queries...")
# Tokenize with query prefix
encoded_queries = self.tokenizer.encode_batch(queries, add_special_tokens=True)
id_lists = [[self.Q_PID] + encoded.ids for encoded in encoded_queries]
# Apply dynamic augmentation with length cap
cap = self.max_query_len or 511
id_lists = [_dynamic_augment(ids, self.mask_token_id, cap) for ids in id_lists]
# Manual padding
max_len = max(len(ids) for ids in id_lists)
batch_size_actual = len(id_lists)
ids = np.zeros((batch_size_actual, max_len), dtype=np.int64)
mask = np.zeros((batch_size_actual, max_len), dtype=np.int64)
for i, id_list in enumerate(id_lists):
ids[i, :len(id_list)] = id_list
mask[i, :len(id_list)] = 1
# Process in batches if specified
if batch_size:
reps = []
for i, a in tqdm(_split_into_batches(ids, mask, batch_size), desc="Encoding query batches"):
reps.append(self._encode_batch(i, a, to_cpu))
return np.concatenate(reps, axis=0)
return self._encode_batch(ids, mask, to_cpu)
def encode_documents(self, documents: List[str], batch_size: Optional[int] = None,
keep_dims: bool = True, to_cpu: bool = False) -> Union[np.ndarray, List[np.ndarray]]:
"""Encode documents - IDENTICAL to PyTorch ColBERT.encode_documents()"""
print(f"Encoding {len(documents)} documents...")
# Encode documents individually to preserve natural lengths
encoded_docs = []
for doc in documents:
encoded = self.tokenizer.encode(doc, add_special_tokens=True)
encoded_docs.append(encoded)
id_lists = []
for encoded in encoded_docs:
ids = encoded.ids
# Truncate to max_doc_len - 1
if len(ids) > self.max_doc_len - 1:
ids = ids[:self.max_doc_len - 1]
# Add D_PID prefix
ids = [self.D_PID] + ids
id_lists.append(ids)
# Manual padding
max_len = max(len(ids) for ids in id_lists)
batch_size_actual = len(id_lists)
ids = np.zeros((batch_size_actual, max_len), dtype=np.int64)
mask = np.zeros((batch_size_actual, max_len), dtype=np.int64)
for i, id_list in enumerate(id_lists):
ids[i, :len(id_list)] = id_list
mask[i, :len(id_list)] = 1
# Apply post-tokenization punctuation masking
for skip_id in self.skip_ids:
mask[ids == skip_id] = 0
# Process in batches if specified
if batch_size:
ids_s, mask_s, rev = _sort_by_length(ids, mask, batch_size)
reps = []
for i, a in tqdm(_split_into_batches(ids_s, mask_s, batch_size), desc="Encoding document batches"):
rep = self._encode_batch(i, a, to_cpu)
if not keep_dims:
m = a.astype(bool)
rep = [r[m[idx]] for idx, r in enumerate(rep)]
reps.append(rep)
if keep_dims:
return _stack_3D_arrays(reps)[rev]
else:
flat = [d for g in reps for d in g]
return [flat[i] for i in rev.tolist()]
# Single batch processing
rep = self._encode_batch(ids, mask, to_cpu)
if not keep_dims:
m = mask.astype(bool)
rep = [r[m[idx]] for idx, r in enumerate(rep)]
return rep
def compute_similarity(q_reps: np.ndarray, p_reps: np.ndarray) -> np.ndarray:
"""Compute ColBERT similarity - IDENTICAL to PyTorch version"""
# Identical to PyTorch: torch.einsum("qin,pjn->qipj", q_reps, p_reps)
token_scores = np.einsum("qin,pjn->qipj", q_reps, p_reps)
# Identical to PyTorch: scores, _ = token_scores.max(-1)
scores = np.max(token_scores, axis=-1)
# Identical to PyTorch: scores = scores.sum(1)
scores = np.sum(scores, axis=1)
return scores
def search(self, queries: List[str], documents: List[str],
batch_size: Optional[int] = None, return_scores: bool = True):
"""End-to-end search - IDENTICAL to PyTorch ColBERT.search()"""
# Encode queries and documents
q_reps = self.encode_queries(queries, batch_size=batch_size, to_cpu=True)
p_reps = self.encode_documents(documents, batch_size=batch_size, to_cpu=True)
if return_scores:
# Compute similarities
print("Computing similarities...")
scores = self.compute_similarity(q_reps, p_reps)
return scores, q_reps, p_reps
return q_reps, p_reps
def rank_documents(self, query: str, documents: List[str], top_k: int = 10) -> List[Tuple]:
"""Rank documents - IDENTICAL to PyTorch ColBERT.rank_documents()"""
scores, _, _ = self.search([query], documents, return_scores=True)
scores = scores.squeeze(0)
# Get top-k results
top_indices = np.argsort(scores)[::-1][:min(top_k, len(documents))]
results = []
for idx in top_indices:
results.append((int(idx), float(scores[idx]), documents[idx]))
return results
# ---------------------------------------------------------------------------
# Helper Functions (NumPy versions)
# ---------------------------------------------------------------------------
def _split_into_batches(ids: np.ndarray, mask: np.ndarray, bsize: int):
return [(ids[i:i + bsize], mask[i:i + bsize])
for i in range(0, ids.shape[0], bsize)]
def _sort_by_length(ids: np.ndarray, mask: np.ndarray, bsize: int):
if ids.shape[0] <= bsize:
return ids, mask, np.arange(ids.shape[0])
lengths = mask.sum(-1)
order = np.argsort(lengths)
reverse = np.argsort(order)
return ids[order], mask[order], reverse
def _dynamic_augment(ids: List[int], mask_id: int, max_cap: int = None) -> List[int]:
if max_cap is not None and len(ids) > max_cap:
return ids[:max_cap]
q_len = len(ids)
target = max(32, ((q_len + 31) // 32) * 32)
if target - q_len < 8:
target = q_len + 8
if max_cap is not None:
target = min(target, max_cap)
return ids + [mask_id] * (target - q_len)
def _stack_3D_arrays(groups):
bsize = sum(x.shape[0] for x in groups)
maxlen = max(x.shape[1] for x in groups)
hdim = groups[0].shape[2]
out = np.zeros((bsize, maxlen, hdim), dtype=groups[0].dtype)
ptr = 0
for g in groups:
out[ptr:ptr + g.shape[0], :g.shape[1]] = g
ptr += g.shape[0]
return out
Notes on reproducing
We welcome anyone to reproduce our results. Here are some tips and observations:
- Please pay attention to the query length. We tried our best to look at what the original ColBERTv2.0 used, what LightOn AI used and also spoke to Nils Reimers on taking liberty in the choice of query lengths.
- Note on query length from ColBERTv2.0 paper:
Unless otherwise stated, we keep the default query maximum sequence length for ColBERTv2 and RocketQAv2, which is 32 tokens. For the ArguAna test in BEIR, as the queries are themselves long documents, we set the maximum query length used by ColBERTv2 and RocketQAv2 to 300. For Climate-FEVER, as the queries are relatively long sentence claims, we set the maximum query length used by ColBERTv2 to 64.
- Query lengths used by LightOn AI PyLate: (Assuming the OSS code is what they used)
query_len = { "quora": 32, "climate-fever": 64, "nq": 32, "msmarco": 32, "hotpotqa": 32, "nfcorpus": 32, "scifact": 48, "trec-covid": 48, "fiqa": 32, "arguana": 64, "scidocs": 48, "dbpedia-entity": 32, "webis-touche2020": 32, "fever": 32, "cqadupstack/android": 32, "cqadupstack/english": 32, "cqadupstack/gaming": 32, "cqadupstack/gis": 32, "cqadupstack/mathematica": 32, "cqadupstack/physics": 32, "cqadupstack/programmers": 32, "cqadupstack/stats": 32, "cqadupstack/tex": 32, "cqadupstack/unix": 32, "cqadupstack/webmasters": 32, "cqadupstack/wordpress": 32, }
- This is what OG Nils had to say when I asked about why query has so much liberty:
Comparison is always hard...I think query length doesn't skew too much. Retrieval compute scales linear with the number of query tokens. So if people are comfortable to compare models with largely different parameters, comparing different query token lengths would be fine as well
- We took a balanced view of both choices and borrowed the query length defaults used by LightOn with only exception of arguana. Instead of original's Colbert's 300 or LightOn's 64 we used 256.
- Nota bene: There may be minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9. But not massive differences like in the case of reported and reproduced Colbert-small in some datasets.
Here are our numbers for the full hindi run on BGE-M3
{'NDCG@1': 0.49714, 'NDCG@3': 0.5115, 'NDCG@5': 0.53908, 'NDCG@10': 0.58936, 'NDCG@100': 0.6457, 'NDCG@1000': 0.65336}
{'MAP@1': 0.28845, 'MAP@3': 0.42424, 'MAP@5': 0.46455, 'MAP@10': 0.49955, 'MAP@100': 0.51886, 'MAP@1000': 0.51933}
{'Recall@10': 0.73032, 'Recall@50': 0.8987, 'Recall@100': 0.93974, 'Recall@200': 0.95763, 'Recall@500': 0.97813, 'Recall@1000': 0.9902}
{'P@1': 0.49714, 'P@3': 0.33048, 'P@5': 0.24629, 'P@10': 0.15543, 'P@100': 0.0202, 'P@1000': 0.00212}
{'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
- We made sure all quirks and known BEIR ColBERT issues are taken care off:
- Arguana and Quora (?) self match issues
- Will add more - TBA
Acknowledgements
- Thanks to Alibaba-NLP for Alibaba-NLP/gte-modernbert-base, which is our base model (as used by LightOn AI)
- Thanks to Nils Reimers for the tips and inputs.
- Thanks to Nandan Thakur for answering questions.
- Thanks to Antoine Chaffin and the entire LightOn team for PyLate.
- Thanks to NanoBEIR authors, its a blessing.
- Thanks to Prithivi Da for his generous funding for this work :-)
Open Questions (still have on ColBERT) / thoughts:
- People worked on ColBERT would agree marginmse loss sucks and KLDiv works great for ColBERT in practice, is there a formal / mathematical study on why marginmse sucks so bad ? (JaColBERT has done some ablations but would love to read why)
- What BERT as an encoder architecture brings to be the best choice for ColBERT compared to other encoder architectures ?
- What were the temperature choices for ColBERT for query, doc scores ?
- Alibaba-NLP/gte-modernbert-base's BEIR avg is 55.33 and beats best ColBERTs in the world (as of 2nd July 2025), so calling single-vec models naive is naive..
Wishlist
- When I can expend more GPU
- would love to try and reproduce Ligton AI's GTE-ModernColBERT BEIR eval numbers.
- would run eval for prithivida/modern_colbert_base_en_v1 on long docs benchmark.
- Downloads last month
- 44