gpahal/bge-m3-onnx-int8 · Hugging Face

This model is a ONNX runtime and int8 quantized version of BGE-M3.

This model outputs dense, sparse and ColBERT embedding representations all at once. The output is a list of numpy arrays in previously mentioned order of representations.

Note: dense and ColBERT embeddings are normalized like the default behavior in the original FlagEmbedding library, if you want unnormalized outputs you can modify the code in export_onnx_int8.py and re-run the script.

This model also has "O2" level graph optimizations applied, you can read more about optimization levels here. If you want ONNX model with different optimization or without optimizations, you can re-run the ONNX export script export_onnx_int8.py with appropriate optimization argument.

Usage with ONNX Runtime (Python)

If you haven't already, you can install the ONNX Runtime Python library:

pip install onnxruntime

For tokenization, you can for example use HF Transformers by installing it:

pip install transformers

Clone this repository with Git LFS to get the ONNX model files.

You can then use the model to compute embeddings, as follows:

import time

from optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = ORTModelForCustomTasks.from_pretrained("gpahal/bge-m3-onnx-int8")

questions = ["What is your opening hour?", "Where are your offices?"]
input_q = tokenizer(
    questions,
    padding=True,
    truncation=True,
    return_tensors="np"
)
print(f"Question input keys: {list(input_q.keys())}, shapes: {[v.shape for v in input_q.values()]}")

t0 = time.perf_counter()
output_q = model(**input_q)
print(f"Time taken: {(time.perf_counter()-t0)*1e3:.1f} ms")

Note: You can use following sparse token weight processor from FlagEmbedding to get same the output for the sparse representation from the ONNX model:

from collections import defaultdict


def process_token_weights(token_weights: np.ndarray, input_ids: list):
    # conver to dict
    result = defaultdict(int)
    unused_tokens = {
        tokenizer.cls_token_id,
        tokenizer.eos_token_id,
        tokenizer.pad_token_id,
        tokenizer.unk_token_id,
    }
    for w, idx in zip(token_weights, input_ids):
        if idx not in unused_tokens and w > 0:
            idx = str(idx)
            if w > result[idx]:
                result[idx] = w
    return result


token_weights = outputs[1].squeeze(-1)
lexical_weights = list(
    map(process_token_weights, token_weights, inputs["input_ids"].tolist())
)

Export ONNX weights

You can export ONNX weights with the provided export_onnx_int8.py ONNX weight export script which leverages HF Optimum. If needed, you can modify the model configuration to for example remove embedding normalization or to not output all three embedding representations. If you modify the number of output representations, you need to also modify the ONNX output config BGEM3OnnxConfig in export_onnx_int8.py.

First, install needed Python requirements as follows:

pip install -r requirements.txt

Then you can export ONNX weights as follows:

python export_onnx.py --opset 17 --device cpu --optimize O2

You can read more about the optional optimization levels here.

gpahal
/

bge-m3-onnx-int8

Usage with ONNX Runtime (Python)

Export ONNX weights

Model tree for gpahal/bge-m3-onnx-int8