Independent Implementation of ColBERTv2.0+ Models

Background: As part of this project, we will be releasing a set of models across weight classes: 1.) Models that worked well, 2.) Experimental models, including failed attempts. This work stands on the shoulders of all previous robust research on ColBERT and variants.

What this independent implementation entail?

This is a humble effort to independently implement Lighton AI's GTE-ModernColBERT .

Without using existing ColBERT libraries (or codebase) like PyLate or Stanford's recipe.

Without any funding, grand GPU budgets, or formal research background.

As of this writing (2nd July 2025)

LightOn AI's ColBERT is the best in the world and can be considered SOTA.
Today we are humbled and thrilled to announce prithivida/modern_colbert_base_en_v1 is the 2nd best ColBERT in the world.. Borrowing Antoine Chaffin's words -

This is the 2nd model to outperform ColBERT-small on BEIR While it is also bigger, it is still a very lightweight model and benefits from the efficiency of ModernBERT!"

Comparison with Top ColBERTv2.0+ Models

Dataset / Model	GTE-ModernColBERT (Lighton AI)	modern_colbert_base_en_v1 (Ours)	ColBERT-small (Answer AI, reproduced by Lighton)	ColBERT-small (Answer AI, reported)
Outfit type	AI Lab with PhDs	Indie Researcher, No PhD, No GPU budgets :-)	AI Lab with PhDs	AI Lab with PhDs
BEIR Average	54.89 (🥇)	54.51 (🥈)	53.35	53.79
FiQA2018	48.51	43.96	41.01	41.15
NFCorpus	37.93	37.23	36.86	37.3
TREC-COVID	83.59	83.4	83.14	84.59
Touche2020	31.23	29.32	24.95	25.69
ArguAna	48.51	52.05	46.76	50.09
QuoraRetrieval	86.61	87.54	87.89	87.72
SCIDOCS	19.06	19.42	18.72	18.42
SciFact	76.34	76.44	74.02	74.77
NQ	61.8	61.68	59.42	59.1
ClimateFEVER	30.62	28.29	32.83	33.07
HotpotQA	77.32	76.667	76.88	76.11
DBPedia	48.03	46.31	46.36	45.58
CQADupstack	41	42.2	39.36	38.75
FEVER	87.44	88.106	88.66	90.96
MSMARCO	45.32	44.993	43.44	43.5

Turns out a very modest GPU budget, a humble background and high quality hard negative mining is a good strart to independently implement the ColBERT's that are in circulation today.
detailed BEIR eval numbers
nanoBEIR eval results

Comparison of with legacy ColBERT models

Both GTE-ModernColBERT and ColBERT-small model cards have this comparison against older Colbert models. please refer to them.

How to use / Running inference:

Short term: We are releasing a lib called [lateness](https://github.com/PrithivirajDamodaran/lateness)
Medium to Long terms: There are really strong storage and retrieval abstractions: VectorDBs like Qdrant, Weaviate or Vespa that support multi-vectors and strong Colbert training libraries like PyLate, So we feel it is best to work the authors and integrate. For now we offer only code to load the model, run inference and do some light weight in-memory ranking (no heavy lifting like storing and retrieving using FAISS indexes).

Using modern_colbert to index and query with Vectordb's like Qdrant.

pip install lateness # light CPU retrievals
or
pip install lateness[index] # GPU accelerated indexing into vdbs

Want to locally run qdrant or use in production cluster ? try out an end to end example here

from lateness import ModernColBERT
colbert = ModernColBERT("prithivida/modern_colbert_base_en_v1",
                        max_query_len = 32,
                        max_doc_len = 300)


documents = [
    "PyTorch is an open-source machine learning framework that provides tensor computations with GPU acceleration and deep neural networks built on tape-based autograd system.",
    "Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containerized applications across clusters of machines.",
    "REST APIs follow representational state transfer architectural style using HTTP methods like GET, POST, PUT, DELETE for stateless client-server communication.",
]

queries = [
    "How to build real-time data pipelines?",
    "What are the benefits of microservices?",
    "How to implement efficient web APIs?"
]

query_embeddings = colbert.encode_queries(queries)
doc_embeddings = colbert.encode_documents(documents)
scores = ModernColBERT.compute_similarity(query_embeddings, doc_embeddings)
print(scores)

Click here for inference code using Transformers

Copy paste the next snippet before running the below snippet.

model_path = "prithivida/modern_colbert_base_en_v1"  

try:
    
    colbert = ColBERT.load_for_inference(model_path, max_query_len=32, max_doc_len=300)
    
    # Test data
    queries = [
        "How does deep learning work?",
        "What is machine learning?",
        "What are neural networks?"
    ]
    
    documents = [
        "Machine learning is the idea of approximating a real world phenomenon using data, the approximation can be mathmetical or otherwise.",
        "Deep learning uses neural networks with multiple layers to process data.",
        "Neural networks are computing systems inspired by biological neural networks.",
        "Artificial intelligence encompasses machine learning and deep learning.",
    ]

    # Encode and find similarity
    print("\n=== Encode and Calculate similarity ===")
    q_reps = colbert.encode_queries(queries, batch_size=4, to_cpu=True)
    p_reps = colbert.encode_documents(documents, batch_size=4, to_cpu=True)
    scores = colbert.compute_similarity(q_reps, p_reps)
    print(scores)
    
    # or Test single query ranking
    print("\n=== Single Query Ranking ===")
    query = "How does deep learning work?"
    results = colbert.rank_documents(query, documents, top_k=3)
    
    print(f"Query: {query}")
    for i, (doc_idx, score, doc_text) in enumerate(results):
        print(f"  {i+1}. Score: {score:.4f} | Doc: {doc_text}")
    
    
except Exception as e:
    print(f"Error during testing: {e}")

import torch
from torch import nn
from transformers import PreTrainedModel, AutoConfig, AutoModel, AutoTokenizer
from transformers.modeling_outputs import BaseModelOutput
from tqdm import tqdm
from typing import List, Tuple, Union, Optional
import string
import os


class TaggingHead(nn.Module):
    def __init__(self, input_size, num_labels):
        super().__init__()
        self.classifier = nn.Linear(input_size, num_labels, bias=False)
        nn.init.xavier_uniform_(self.classifier.weight)

    def forward(self, x):
        return self.classifier(x)


class ColBERT(PreTrainedModel):
    config_class = AutoConfig
    base_model_prefix = "backbone"
    
    def __init__(self, config):
        super().__init__(config)
        self.backbone = AutoModel.from_config(config)
        hidden_dim = config.hidden_size
        self.heads = nn.ModuleDict({
            "col_pooling": TaggingHead(hidden_dim, num_labels=128)
        })
        
        # Inference settings (will be set when loading for inference)
        self.tokenizer = None
        self.max_query_len = 256
        self.max_doc_len = 300
        self.Q_PID = None
        self.D_PID = None
    
    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if isinstance(module, nn.Linear) and module.bias is not None:
            module.bias.data.zero_()
    
    def forward(self, input_ids, attention_mask=None, position_ids=None, return_dict=False, **kwargs):
        kwargs.pop("token_type_ids", None)
        
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            return_dict=True,
            **kwargs
        )
        
        reps = outputs.last_hidden_state
        reps = torch.nn.functional.normalize(reps, p=2, dim=2)
        reps *= attention_mask[:, :, None].float()
        logits = self.heads["col_pooling"](reps)
        
        if return_dict:
            return BaseModelOutput(last_hidden_state=logits)
        return logits
    
    @classmethod
    def load_for_inference(cls, model_name_or_path: str, max_query_len: int = 256, 
                          max_doc_len: int = 300, device: str = None):
        """
        Load ColBERT model with tokenizer for inference
        
        Args:
            model_name_or_path: HuggingFace model path or local directory
            max_query_len: Maximum query length
            max_doc_len: Maximum document length
            device: Device to run inference on (auto-detect if None)
        """
        device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        
        try:
            # Load model and tokenizer
            if os.path.exists(model_name_or_path):
                print(f"Loading model from local directory: {model_name_or_path}")
                config = AutoConfig.from_pretrained(model_name_or_path)
                model = cls.from_pretrained(model_name_or_path, config=config)
                tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
            else:
                print(f"Downloading model from HuggingFace Hub: {model_name_or_path}")
                config = AutoConfig.from_pretrained(model_name_or_path)
                model = cls.from_pretrained(model_name_or_path, config=config)
                tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
            
            # Setup inference configuration
            model.tokenizer = tokenizer
            model.max_query_len = max_query_len
            model.max_doc_len = max_doc_len
            model.Q_PID = tokenizer.convert_tokens_to_ids("[unused0]")
            model.D_PID = tokenizer.convert_tokens_to_ids("[unused1]")
            # Setup post-tokenization punctuation masking
            model.skip_ids = {tokenizer.encode(c, add_special_tokens=False)[0]
                             for c in string.punctuation}
            
            model.to(device)
            model.eval()
            
            print(f"ColBERT model loaded on {device}")
            print(f"Query max length: {max_query_len}, Document max length: {max_doc_len}")
            
            return model
            
        except Exception as e:
            print(f"Error loading model: {e}")
            raise
    
    def _encode_batch(self, ids: torch.Tensor, mask: torch.Tensor, to_cpu: bool = False):
        """Internal encoding function"""
        if self.tokenizer is None:
            raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
        
        ids, mask = ids.to(self.device), mask.to(self.device)
        pos = torch.arange(ids.size(1), device=self.device).unsqueeze(0).expand_as(ids)
        
        with torch.no_grad():
            rep = self(input_ids=ids, attention_mask=mask, position_ids=pos)
        
        return rep.cpu() if to_cpu else rep
    
    def encode_queries(self, queries: List[str], batch_size: Optional[int] = None, to_cpu: bool = False):
        """
        Encode queries for ColBERT retrieval
        
        Args:
            queries: List of query strings
            batch_size: Batch size for processing (None for single batch)
            to_cpu: Whether to move results to CPU
            
        Returns:
            Query representations tensor
        """
        if self.tokenizer is None:
            raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
        
        print(f"Encoding {len(queries)} queries...")
        
        # Tokenize with query prefix
        enc = self.tokenizer(queries, add_special_tokens=True, truncation=False)
        id_lists = [[self.Q_PID] + ids for ids in enc["input_ids"]]
        
        # Apply dynamic augmentation with length cap
        cap = self.max_query_len or (self.tokenizer.model_max_length - 1)
        id_lists = [_dynamic_augment(ids, self.tokenizer.mask_token_id, cap) for ids in id_lists]
        
        # Pad sequences
        padded = self.tokenizer.pad({"input_ids": id_lists}, padding=True, return_tensors="pt")
        ids, mask = padded["input_ids"], padded["attention_mask"]
        
        # Process in batches if specified
        if batch_size:
            reps = []
            for i, a in tqdm(_split_into_batches(ids, mask, batch_size), desc="Encoding query batches"):
                reps.append(self._encode_batch(i, a, to_cpu))
            return torch.cat(reps)
        
        return self._encode_batch(ids, mask, to_cpu)
    
    def encode_documents(self, documents: List[str], batch_size: Optional[int] = None, 
                        keep_dims: bool = True, to_cpu: bool = False):
        """
        Encode documents for ColBERT retrieval with post-tokenization punctuation masking
        
        Args:
            documents: List of document strings
            batch_size: Batch size for processing (None for single batch)
            keep_dims: Whether to keep tensor dimensions (True) or return list of variable-length tensors
            to_cpu: Whether to move results to CPU
            
        Returns:
            Document representations tensor or list
        """
        if self.tokenizer is None:
            raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
        
        print(f"Encoding {len(documents)} documents...")
        
        # Tokenize documents WITHOUT removing punctuation (post-tokenization masking)
        enc = self.tokenizer(documents, add_special_tokens=True, 
                           truncation=True, max_length=self.max_doc_len - 1)
        id_lists = [[self.D_PID] + ids for ids in enc["input_ids"]]
        
        # Pad sequences
        padded = self.tokenizer.pad({"input_ids": id_lists}, padding=True, return_tensors="pt")
        ids, mask = padded["input_ids"], padded["attention_mask"]
        
        # Apply post-tokenization punctuation masking
        mask[torch.isin(ids, torch.tensor(list(self.skip_ids), device=ids.device))] = 0
        
        # Process in batches if specified
        if batch_size:
            ids_s, mask_s, rev = _sort_by_length(ids, mask, batch_size)
            reps = []
            
            for i, a in tqdm(_split_into_batches(ids_s, mask_s, batch_size), desc="Encoding document batches"):
                rep = self._encode_batch(i, a, to_cpu)
                if not keep_dims:
                    # Convert to list of variable-length tensors
                    m = a.cpu().bool() if to_cpu else a.bool()
                    rep = [r[m[idx]] for idx, r in enumerate(rep)]
                reps.append(rep)
            
            if keep_dims:
                return _stack_3D_tensors(reps)[rev]
            else:
                # Flatten and reorder
                flat = [d for g in reps for d in g]
                return [flat[i] for i in rev.tolist()]
        
        # Single batch processing
        rep = self._encode_batch(ids, mask, to_cpu)
        if not keep_dims:
            m = mask.cpu().bool() if to_cpu else mask.bool()
            rep = [r[m[idx]] for idx, r in enumerate(rep)]
        
        return rep
    
    def compute_similarity(q_reps: torch.Tensor, p_reps: torch.Tensor):
        """
        Compute ColBERT-style max similarity between queries and passages
        
        Args:
            q_reps: Query representations [num_queries, max_q_len, dim]
            p_reps: Passage representations [num_passages, max_p_len, dim]
            
        Returns:
            Similarity scores [num_queries, num_passages]
        """
        token_scores = torch.einsum("qin,pjn->qipj", q_reps, p_reps)
        scores, _ = token_scores.max(-1)
        scores = scores.sum(1)
        return scores
    
    def search(self, queries: List[str], documents: List[str], 
               batch_size: Optional[int] = None, return_scores: bool = True):
        """
        End-to-end search: encode queries and documents, compute similarities
        
        Args:
            queries: List of query strings
            documents: List of document strings
            batch_size: Batch size for encoding
            return_scores: Whether to return similarity scores
            
        Returns:
            If return_scores=True: (scores, query_reps, doc_reps)
            If return_scores=False: (query_reps, doc_reps)
        """
        # Encode queries and documents
        q_reps = self.encode_queries(queries, batch_size=batch_size, to_cpu=True)
        p_reps = self.encode_documents(documents, batch_size=batch_size, to_cpu=True)
        
        if return_scores:
            # Compute similarities
            print("Computing similarities...")
            scores = self.compute_similarity(q_reps, p_reps)
            return scores, q_reps, p_reps
        
        return q_reps, p_reps
    
    def rank_documents(self, query: str, documents: List[str], top_k: int = 10):
        """
        Rank documents for a single query
        
        Args:
            query: Query string
            documents: List of document strings
            top_k: Number of top results to return
            
        Returns:
            List of (document_index, score, document_text) tuples
        """
        scores, _, _ = self.search([query], documents, return_scores=True)
        scores = scores.squeeze(0)  # Remove query dimension
        
        # Get top-k results
        top_indices = torch.topk(scores, min(top_k, len(documents))).indices
        
        results = []
        for idx in top_indices:
            results.append((idx.item(), scores[idx].item(), documents[idx.item()]))
        
        return results



# ---------------------------------------------------------------------------
# Helper Functions
# ---------------------------------------------------------------------------

def _split_into_batches(ids: torch.Tensor, mask: torch.Tensor, bsize: int):
    return [(ids[i:i + bsize], mask[i:i + bsize])
            for i in range(0, ids.size(0), bsize)]

def _sort_by_length(ids: torch.Tensor, mask: torch.Tensor, bsize: int):
    if ids.size(0) <= bsize:
        return ids, mask, torch.arange(ids.size(0))
    
    lengths = mask.sum(-1)
    order = lengths.sort().indices
    reverse = order.sort().indices
    return ids[order], mask[order], reverse

def _dynamic_augment(ids: List[int], mask_id: int, max_cap: int = None) -> List[int]:
    if max_cap is not None and len(ids) > max_cap:
        return ids[:max_cap]
    
    q_len = len(ids)
    target = max(32, ((q_len + 31) // 32) * 32)
    if target - q_len < 8:
        target = q_len + 8
    if max_cap is not None:
        target = min(target, max_cap)
    return ids + [mask_id] * (target - q_len)

def _stack_3D_tensors(groups):
    bsize = sum(x.size(0) for x in groups)
    maxlen = max(x.size(1) for x in groups)
    hdim = groups[0].size(2)
    out = torch.zeros(bsize, maxlen, hdim, device=groups[0].device, dtype=groups[0].dtype)
    ptr = 0
    for g in groups:
        out[ptr:ptr + g.size(0), :g.size(1)] = g
        ptr += g.size(0)
    return out

Click here for inference code using ONNX

Copy paste the next snippet before running the below snippet.

model_path = "prithivida/modern_colbert_base_en_v1"
onnx_model_path = "prithivida/modern_colbert_base_en_v1/onnx/model.onnx" 

# Load ONNX model for inference using the standalone tokenizer path
onnx_colbert = ONNXColBERT(onnx_model_path, model_path , max_query_len=32, max_doc_len=300) # Pass model_path as tokenizer_path

# Test inference
queries = [
        "How does deep learning work?",
        "What is machine learning?",
        "What are neural networks?"
    ]

documents = [
    "Machine learning is the idea of approximating a real world phenomenon using data, the approximation can be mathmetical or otherwise.",
    "Deep learning uses neural networks with multiple layers to process data.",
    "Neural networks are computing systems inspired by biological neural networks.",
    "Artificial intelligence encompasses machine learning and deep learning.",
]

# Encode and find similarity
print("\n=== ONNX Encode and Compute similarity ===")
q_reps = onnx_colbert.encode_queries(queries, batch_size=4, to_cpu=True)
p_reps = onnx_colbert.encode_documents(documents, batch_size=4, to_cpu=True)
scores = onnx_colbert.compute_similarity(q_reps, p_reps)


# or Test single query ranking
print("\n=== ONNX Standalone Single Query Ranking ===")
query = "How does deep learning work?"
results = onnx_colbert.rank_documents(query, documents, top_k=3)

print(f"Query: {query}")
for i, (doc_idx, score, doc_text) in enumerate(results):
    print(f"  {i+1}. Score: {score:.4f} | Doc: {doc_text}")


import numpy as np
import onnxruntime as ort
from tokenizers import AddedToken, Tokenizer
import json
import string
from pathlib import Path
from typing import List, Optional, Tuple, Union
from tqdm import tqdm


# ---------------------------------------------------------------------------
# ONNX ColBERT Class
# ---------------------------------------------------------------------------

class ONNXColBERT:
    def __init__(self, onnx_model_path: str, tokenizer_path: str,
                 max_query_len: int = 256, max_doc_len: int = 300,
                 providers: Optional[List[str]] = None):
        """
        ONNX ColBERT - identical to PyTorch ColBERT.load_for_inference()
        
        Args:
            onnx_model_path: Path to the ONNX model file
            tokenizer_path: Path to the tokenizer directory
            max_query_len: Maximum query length
            max_doc_len: Maximum document length
            providers: ONNX Runtime providers
        """
        # Load standalone tokenizer
        self.model_dir = Path(tokenizer_path)
        self.tokenizer = self._get_tokenizer(max_length=512)
        self.max_query_len = max_query_len
        self.max_doc_len = max_doc_len
        
        # Setup inference configuration
        self.Q_PID = self.tokenizer.token_to_id("[unused0]")
        self.D_PID = self.tokenizer.token_to_id("[unused1]")
        self.mask_token_id = self.tokenizer.token_to_id("[MASK]")
        
        if None in [self.Q_PID, self.D_PID, self.mask_token_id]:
            raise ValueError("Could not find required special tokens in tokenizer")
        
        # Setup post-tokenization punctuation masking
        self.skip_ids = set()
        for c in string.punctuation:
            encoded = self.tokenizer.encode(c, add_special_tokens=False)
            if len(encoded.ids) > 0:
                self.skip_ids.add(encoded.ids[0])
        
        print(f"Identified {len(self.skip_ids)} punctuation token IDs to skip")
        
        # Initialize ONNX Runtime session
        if providers is None:
            providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
        
        self.session = ort.InferenceSession(onnx_model_path, providers=providers)
        print(f"✅ ONNX ColBERT loaded with providers: {self.session.get_providers()}")
        print(f"Query max length: {max_query_len}, Document max length: {max_doc_len}")

    def _get_tokenizer(self, max_length: int = 512) -> Tokenizer:
        """Initialize tokenizer"""
        with open(str(self.model_dir / "config.json")) as config_file:
            config = json.load(config_file)
        with open(str(self.model_dir / "tokenizer_config.json")) as tokenizer_config_file:
            tokenizer_config = json.load(tokenizer_config_file)
        with open(str(self.model_dir / "special_tokens_map.json")) as tokens_map_file:
            tokens_map = json.load(tokens_map_file)
        
        tokenizer = Tokenizer.from_file(str(self.model_dir / "tokenizer.json"))
        tokenizer.enable_truncation(max_length=min(tokenizer_config["model_max_length"], max_length))
        tokenizer.enable_padding(pad_id=config["pad_token_id"], pad_token=tokenizer_config["pad_token"])
        
        for token in tokens_map.values():
            if isinstance(token, str):
                tokenizer.add_special_tokens([token])
            elif isinstance(token, dict):
                tokenizer.add_special_tokens([AddedToken(**token)])
        
        return tokenizer

    def _encode_batch(self, ids: np.ndarray, mask: np.ndarray, to_cpu: bool = False) -> np.ndarray:
        """Internal encoding function"""
        # Create position IDs
        pos = np.arange(ids.shape[1])[None, :].repeat(ids.shape[0], axis=0)
        
        # ONNX inference
        inputs = {
            "input_ids": ids.astype(np.int64),
            "attention_mask": mask.astype(np.int64),
            "position_ids": pos.astype(np.int64)
        }
        
        outputs = self.session.run(["last_hidden_state"], inputs)
        return outputs[0]

    def encode_queries(self, queries: List[str], batch_size: Optional[int] = None, 
                      to_cpu: bool = False) -> np.ndarray:
        """Encode queries - IDENTICAL to PyTorch ColBERT.encode_queries()"""
        print(f"Encoding {len(queries)} queries...")
        
        # Tokenize with query prefix
        encoded_queries = self.tokenizer.encode_batch(queries, add_special_tokens=True)
        id_lists = [[self.Q_PID] + encoded.ids for encoded in encoded_queries]
        
        # Apply dynamic augmentation with length cap
        cap = self.max_query_len or 511
        id_lists = [_dynamic_augment(ids, self.mask_token_id, cap) for ids in id_lists]
        
        # Manual padding
        max_len = max(len(ids) for ids in id_lists)
        batch_size_actual = len(id_lists)
        
        ids = np.zeros((batch_size_actual, max_len), dtype=np.int64)
        mask = np.zeros((batch_size_actual, max_len), dtype=np.int64)
        
        for i, id_list in enumerate(id_lists):
            ids[i, :len(id_list)] = id_list
            mask[i, :len(id_list)] = 1
        
        # Process in batches if specified
        if batch_size:
            reps = []
            for i, a in tqdm(_split_into_batches(ids, mask, batch_size), desc="Encoding query batches"):
                reps.append(self._encode_batch(i, a, to_cpu))
            return np.concatenate(reps, axis=0)
        
        return self._encode_batch(ids, mask, to_cpu)

    def encode_documents(self, documents: List[str], batch_size: Optional[int] = None,
                        keep_dims: bool = True, to_cpu: bool = False) -> Union[np.ndarray, List[np.ndarray]]:
        """Encode documents - IDENTICAL to PyTorch ColBERT.encode_documents()"""
        print(f"Encoding {len(documents)} documents...")
        
        # Encode documents individually to preserve natural lengths
        encoded_docs = []
        for doc in documents:
            encoded = self.tokenizer.encode(doc, add_special_tokens=True)
            encoded_docs.append(encoded)
        
        id_lists = []
        for encoded in encoded_docs:
            ids = encoded.ids
            # Truncate to max_doc_len - 1
            if len(ids) > self.max_doc_len - 1:
                ids = ids[:self.max_doc_len - 1]
            # Add D_PID prefix
            ids = [self.D_PID] + ids
            id_lists.append(ids)
        
        # Manual padding
        max_len = max(len(ids) for ids in id_lists)
        batch_size_actual = len(id_lists)
        
        ids = np.zeros((batch_size_actual, max_len), dtype=np.int64)
        mask = np.zeros((batch_size_actual, max_len), dtype=np.int64)
        
        for i, id_list in enumerate(id_lists):
            ids[i, :len(id_list)] = id_list
            mask[i, :len(id_list)] = 1
        
        # Apply post-tokenization punctuation masking
        for skip_id in self.skip_ids:
            mask[ids == skip_id] = 0
        
        # Process in batches if specified
        if batch_size:
            ids_s, mask_s, rev = _sort_by_length(ids, mask, batch_size)
            reps = []
            
            for i, a in tqdm(_split_into_batches(ids_s, mask_s, batch_size), desc="Encoding document batches"):
                rep = self._encode_batch(i, a, to_cpu)
                if not keep_dims:
                    m = a.astype(bool)
                    rep = [r[m[idx]] for idx, r in enumerate(rep)]
                reps.append(rep)
            
            if keep_dims:
                return _stack_3D_arrays(reps)[rev]
            else:
                flat = [d for g in reps for d in g]
                return [flat[i] for i in rev.tolist()]
        
        # Single batch processing
        rep = self._encode_batch(ids, mask, to_cpu)
        if not keep_dims:
            m = mask.astype(bool)
            rep = [r[m[idx]] for idx, r in enumerate(rep)]
        
        return rep

    def compute_similarity(q_reps: np.ndarray, p_reps: np.ndarray) -> np.ndarray:
        """Compute ColBERT similarity - IDENTICAL to PyTorch version"""
        # Identical to PyTorch: torch.einsum("qin,pjn->qipj", q_reps, p_reps)
        token_scores = np.einsum("qin,pjn->qipj", q_reps, p_reps)
        
        # Identical to PyTorch: scores, _ = token_scores.max(-1)
        scores = np.max(token_scores, axis=-1)
        
        # Identical to PyTorch: scores = scores.sum(1)
        scores = np.sum(scores, axis=1)
        
        return scores

    def search(self, queries: List[str], documents: List[str],
               batch_size: Optional[int] = None, return_scores: bool = True):
        """End-to-end search - IDENTICAL to PyTorch ColBERT.search()"""
        # Encode queries and documents
        q_reps = self.encode_queries(queries, batch_size=batch_size, to_cpu=True)
        p_reps = self.encode_documents(documents, batch_size=batch_size, to_cpu=True)
        
        if return_scores:
            # Compute similarities
            print("Computing similarities...")
            scores = self.compute_similarity(q_reps, p_reps)
            return scores, q_reps, p_reps
        
        return q_reps, p_reps

    def rank_documents(self, query: str, documents: List[str], top_k: int = 10) -> List[Tuple]:
        """Rank documents - IDENTICAL to PyTorch ColBERT.rank_documents()"""
        scores, _, _ = self.search([query], documents, return_scores=True)
        scores = scores.squeeze(0)
        
        # Get top-k results
        top_indices = np.argsort(scores)[::-1][:min(top_k, len(documents))]
        
        results = []
        for idx in top_indices:
            results.append((int(idx), float(scores[idx]), documents[idx]))
        
        return results



# ---------------------------------------------------------------------------
# Helper Functions (NumPy versions)
# ---------------------------------------------------------------------------

def _split_into_batches(ids: np.ndarray, mask: np.ndarray, bsize: int):
    return [(ids[i:i + bsize], mask[i:i + bsize])
            for i in range(0, ids.shape[0], bsize)]

def _sort_by_length(ids: np.ndarray, mask: np.ndarray, bsize: int):
    if ids.shape[0] <= bsize:
        return ids, mask, np.arange(ids.shape[0])
    
    lengths = mask.sum(-1)
    order = np.argsort(lengths)
    reverse = np.argsort(order)
    return ids[order], mask[order], reverse

def _dynamic_augment(ids: List[int], mask_id: int, max_cap: int = None) -> List[int]:
    if max_cap is not None and len(ids) > max_cap:
        return ids[:max_cap]
    
    q_len = len(ids)
    target = max(32, ((q_len + 31) // 32) * 32)
    if target - q_len < 8:
        target = q_len + 8
    if max_cap is not None:
        target = min(target, max_cap)
    return ids + [mask_id] * (target - q_len)

def _stack_3D_arrays(groups):
    bsize = sum(x.shape[0] for x in groups)
    maxlen = max(x.shape[1] for x in groups)
    hdim = groups[0].shape[2]
    out = np.zeros((bsize, maxlen, hdim), dtype=groups[0].dtype)
    ptr = 0
    for g in groups:
        out[ptr:ptr + g.shape[0], :g.shape[1]] = g
        ptr += g.shape[0]
    return out

Notes on reproducing

We welcome anyone to reproduce our results. Here are some tips and observations:

Please pay attention to the query length. We tried our best to look at what the original ColBERTv2.0 used, what LightOn AI used and also spoke to Nils Reimers on taking liberty in the choice of query lengths.
Note on query length from ColBERTv2.0 paper:

Unless otherwise stated, we keep the default query maximum sequence length for ColBERTv2 and RocketQAv2, which is 32 tokens. For the ArguAna test in BEIR, as the queries are themselves long documents, we set the maximum query length used by ColBERTv2 and RocketQAv2 to 300. For Climate-FEVER, as the queries are relatively long sentence claims, we set the maximum query length used by ColBERTv2 to 64.

Query lengths used by LightOn AI PyLate: (Assuming the OSS code is what they used)

 query_len = {
      "quora": 32,
      "climate-fever": 64,
      "nq": 32,
      "msmarco": 32,
      "hotpotqa": 32,
      "nfcorpus": 32,
      "scifact": 48,
      "trec-covid": 48,
      "fiqa": 32,
      "arguana": 64,
      "scidocs": 48,
      "dbpedia-entity": 32,
      "webis-touche2020": 32,
      "fever": 32,
      "cqadupstack/android": 32,
      "cqadupstack/english": 32,
      "cqadupstack/gaming": 32,
      "cqadupstack/gis": 32,
      "cqadupstack/mathematica": 32,
      "cqadupstack/physics": 32,
      "cqadupstack/programmers": 32,
      "cqadupstack/stats": 32,
      "cqadupstack/tex": 32,
      "cqadupstack/unix": 32,
      "cqadupstack/webmasters": 32,
      "cqadupstack/wordpress": 32,
  }

This is what OG Nils had to say when I asked about why query has so much liberty:

Comparison is always hard...I think query length doesn't skew too much. Retrieval compute scales linear with the number of query tokens. So if people are comfortable to compare models with largely different parameters, comparing different query token lengths would be fine as well
We took a balanced view of both choices and borrowed the query length defaults used by LightOn with only exception of arguana. Instead of original's Colbert's 300 or LightOn's 64 we used 256.
Nota bene: There may be minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9. But not massive differences like in the case of reported and reproduced Colbert-small in some datasets.

Here are our numbers for the full hindi run on BGE-M3

{'NDCG@1': 0.49714, 'NDCG@3': 0.5115, 'NDCG@5': 0.53908, 'NDCG@10': 0.58936, 'NDCG@100': 0.6457, 'NDCG@1000': 0.65336}
{'MAP@1': 0.28845, 'MAP@3': 0.42424, 'MAP@5': 0.46455, 'MAP@10': 0.49955, 'MAP@100': 0.51886, 'MAP@1000': 0.51933}
{'Recall@10': 0.73032, 'Recall@50': 0.8987, 'Recall@100': 0.93974, 'Recall@200': 0.95763, 'Recall@500': 0.97813, 'Recall@1000': 0.9902}
{'P@1': 0.49714, 'P@3': 0.33048, 'P@5': 0.24629, 'P@10': 0.15543, 'P@100': 0.0202, 'P@1000': 0.00212}
{'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}

We made sure all quirks and known BEIR ColBERT issues are taken care off:
- Arguana and Quora (?) self match issues
- Will add more - TBA

Acknowledgements

Thanks to Alibaba-NLP for Alibaba-NLP/gte-modernbert-base, which is our base model (as used by LightOn AI)
Thanks to Nils Reimers for the tips and inputs.
Thanks to Nandan Thakur for answering questions.
Thanks to Antoine Chaffin and the entire LightOn team for PyLate.
Thanks to NanoBEIR authors, its a blessing.
Thanks to Prithivi Da for his generous funding for this work :-)

Open Questions (still have on ColBERT) / thoughts:

People worked on ColBERT would agree marginmse loss sucks and KLDiv works great for ColBERT in practice, is there a formal / mathematical study on why marginmse sucks so bad ? (JaColBERT has done some ablations but would love to read why)
What BERT as an encoder architecture brings to be the best choice for ColBERT compared to other encoder architectures ?
What were the temperature choices for ColBERT for query, doc scores ?
Alibaba-NLP/gte-modernbert-base's BEIR avg is 55.33 and beats best ColBERTs in the world (as of 2nd July 2025), so calling single-vec models naive is naive..

Wishlist

When I can expend more GPU
- would love to try and reproduce Ligton AI's GTE-ModernColBERT BEIR eval numbers.
- would run eval for prithivida/modern_colbert_base_en_v1 on long docs benchmark.

prithivida
/

modern_colbert_base_en_v1

Independent Implementation of ColBERTv2.0+ Models

Comparison with Top ColBERTv2.0+ Models

Comparison of with legacy ColBERT models

How to use / Running inference:

Using modern_colbert to index and query with Vectordb's like Qdrant.

Notes on reproducing

Acknowledgements

Open Questions (still have on ColBERT) / thoughts:

Wishlist

Collection including prithivida/modern_colbert_base_en_v1

Colbert (multi-vec models)