gpahal commited on
Commit
2b34e84
·
verified ·
1 Parent(s): 69254c7

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -14,6 +14,7 @@
14
  *.npy filter=lfs diff=lfs merge=lfs -text
15
  *.npz filter=lfs diff=lfs merge=lfs -text
16
  *.onnx filter=lfs diff=lfs merge=lfs -text
 
17
  *.ot filter=lfs diff=lfs merge=lfs -text
18
  *.parquet filter=lfs diff=lfs merge=lfs -text
19
  *.pb filter=lfs diff=lfs merge=lfs -text
@@ -33,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
14
  *.npy filter=lfs diff=lfs merge=lfs -text
15
  *.npz filter=lfs diff=lfs merge=lfs -text
16
  *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.onnx.data filter=lfs diff=lfs merge=lfs -text
18
  *.ot filter=lfs diff=lfs merge=lfs -text
19
  *.parquet filter=lfs diff=lfs merge=lfs -text
20
  *.pb filter=lfs diff=lfs merge=lfs -text
 
34
  *.zip filter=lfs diff=lfs merge=lfs -text
35
  *.zst filter=lfs diff=lfs merge=lfs -text
36
  *tfevents* filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,111 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: BAAI/bge-m3
3
+ license: mit
4
+ tags:
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - multilingual
8
+ - embedding
9
+ - hybrid-retrieval
10
+ - onnx
11
+ - onnxruntime
12
+ - optimum
13
+ - quantization
14
+ ---
15
+
16
+ This model is a ONNX runtime and int8 quantized version of [BGE-M3](https://huggingface.co/BAAI/bge-m3).
17
+
18
+ This model outputs dense, sparse and ColBERT embedding representations all at once. The output is a list of numpy arrays in previously mentioned order of representations.
19
+
20
+ Note: dense and ColBERT embeddings are normalized like the default behavior in the original FlagEmbedding library, if you want unnormalized outputs you can modify the code in `export_onnx_int8.py` and re-run the script.
21
+
22
+ This model also has "O2" level graph optimizations applied, you can read more about optimization levels [here](https://huggingface.co/docs/optimum/en/onnxruntime/usage_guides/optimization). If you want ONNX model with different optimization or without optimizations, you can re-run the ONNX export script `export_onnx_int8.py` with appropriate optimization argument.
23
+
24
+ ## Usage with ONNX Runtime (Python)
25
+
26
+ If you haven't already, you can install the [ONNX Runtime](https://onnxruntime.ai/) Python library:
27
+
28
+ ```bash
29
+ pip install onnxruntime
30
+ ```
31
+
32
+ For tokenization, you can for example use HF Transformers by installing it:
33
+
34
+ ```bash
35
+ pip install transformers
36
+ ```
37
+
38
+ Clone this repository with [Git LFS](https://git-lfs.com/) to get the ONNX model files.
39
+
40
+ You can then use the model to compute embeddings, as follows:
41
+
42
+ ```python
43
+ import time
44
+
45
+ from optimum.onnxruntime import ORTModelForCustomTasks
46
+ from transformers import AutoTokenizer
47
+
48
+ tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
49
+ model = ORTModelForCustomTasks.from_pretrained("gpahal/bge-m3-onnx-int8")
50
+
51
+ questions = ["What is your opening hour?", "Where are your offices?"]
52
+ input_q = tokenizer(
53
+ questions,
54
+ padding=True,
55
+ truncation=True,
56
+ return_tensors="np"
57
+ )
58
+ print(f"Question input keys: {list(input_q.keys())}, shapes: {[v.shape for v in input_q.values()]}")
59
+
60
+ t0 = time.perf_counter()
61
+ output_q = model(**input_q)
62
+ print(f"Time taken: {(time.perf_counter()-t0)*1e3:.1f} ms")
63
+ ```
64
+
65
+ Note: You can use following sparse token weight processor from FlagEmbedding to get same the output for the sparse representation from the ONNX model:
66
+
67
+ ```python
68
+ from collections import defaultdict
69
+
70
+
71
+ def process_token_weights(token_weights: np.ndarray, input_ids: list):
72
+ # conver to dict
73
+ result = defaultdict(int)
74
+ unused_tokens = {
75
+ tokenizer.cls_token_id,
76
+ tokenizer.eos_token_id,
77
+ tokenizer.pad_token_id,
78
+ tokenizer.unk_token_id,
79
+ }
80
+ for w, idx in zip(token_weights, input_ids):
81
+ if idx not in unused_tokens and w > 0:
82
+ idx = str(idx)
83
+ if w > result[idx]:
84
+ result[idx] = w
85
+ return result
86
+
87
+
88
+ token_weights = outputs[1].squeeze(-1)
89
+ lexical_weights = list(
90
+ map(process_token_weights, token_weights, inputs["input_ids"].tolist())
91
+ )
92
+ ```
93
+
94
+ ## Export ONNX weights
95
+
96
+ You can export ONNX weights with the provided `export_onnx_int8.py` ONNX weight export script which leverages HF Optimum.
97
+ If needed, you can modify the model configuration to for example remove embedding normalization or to not output all three embedding representations. If you modify the number of output representations, you need to also modify the ONNX output config `BGEM3OnnxConfig` in `export_onnx_int8.py`.
98
+
99
+ First, install needed Python requirements as follows:
100
+
101
+ ```bash
102
+ pip install -r requirements.txt
103
+ ```
104
+
105
+ Then you can export ONNX weights as follows:
106
+
107
+ ```bash
108
+ python export_onnx.py --opset 17 --device cpu --optimize O2
109
+ ```
110
+
111
+ You can read more about the optional optimization levels [here](https://huggingface.co/docs/optimum/en/onnxruntime/usage_guides/optimization).
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": ".",
3
+ "architectures": ["XLMRobertaModel"],
4
+ "attention_probs_dropout_prob": 0.1,
5
+ "bos_token_id": 0,
6
+ "classifier_dropout": null,
7
+ "eos_token_id": 2,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 1024,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4096,
13
+ "layer_norm_eps": 1e-5,
14
+ "max_position_embeddings": 8194,
15
+ "model_type": "xlm-roberta",
16
+ "num_attention_heads": 16,
17
+ "num_hidden_layers": 24,
18
+ "output_past": true,
19
+ "pad_token_id": 1,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.52.4",
23
+ "type_vocab_size": 1,
24
+ "use_cache": true,
25
+ "vocab_size": 250002
26
+ }
export_onnx_int8.py ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import copy
3
+ import logging
4
+ import os
5
+ from collections import OrderedDict
6
+
7
+ import torch
8
+ from huggingface_hub import snapshot_download
9
+ from optimum.exporters.onnx import onnx_export_from_model
10
+ from optimum.exporters.onnx.model_configs import XLMRobertaOnnxConfig
11
+ from optimum.exporters.tasks import TasksManager
12
+ from optimum.onnxruntime import ORTQuantizer
13
+ from optimum.onnxruntime.configuration import AutoQuantizationConfig
14
+ from torch import Tensor, nn
15
+ from transformers import AutoConfig, AutoModel
16
+
17
+ logger = logging.getLogger(__name__)
18
+
19
+ class BGEM3InferenceModel(nn.Module):
20
+ def __init__(
21
+ self,
22
+ model_name: str = "BAAI/bge-m3",
23
+ colbert_dim: int = -1,
24
+ ) -> None:
25
+ super().__init__()
26
+
27
+ model_name = snapshot_download(
28
+ repo_id=model_name,
29
+ allow_patterns=[
30
+ "model.safetensors",
31
+ "colbert_linear.pt",
32
+ "sparse_linear.pt",
33
+ "config.json",
34
+ ],
35
+ )
36
+
37
+ self.config = AutoConfig.from_pretrained(model_name)
38
+ self.model = AutoModel.from_pretrained(model_name)
39
+ self.colbert_linear = torch.nn.Linear(
40
+ in_features=self.model.config.hidden_size,
41
+ out_features=(
42
+ self.model.config.hidden_size if colbert_dim == -1 else colbert_dim
43
+ ),
44
+ )
45
+ self.sparse_linear = torch.nn.Linear(
46
+ in_features=self.model.config.hidden_size, out_features=1
47
+ )
48
+ colbert_state_dict = torch.load(
49
+ os.path.join(model_name, "colbert_linear.pt"), map_location="cpu"
50
+ )
51
+ sparse_state_dict = torch.load(
52
+ os.path.join(model_name, "sparse_linear.pt"), map_location="cpu"
53
+ )
54
+ self.colbert_linear.load_state_dict(colbert_state_dict)
55
+ self.sparse_linear.load_state_dict(sparse_state_dict)
56
+
57
+ def dense_embedding(self, last_hidden_state: Tensor) -> Tensor:
58
+ return last_hidden_state[:, 0]
59
+
60
+ def sparse_embedding(self, last_hidden_state: Tensor) -> Tensor:
61
+ with torch.no_grad():
62
+ return torch.relu(self.sparse_linear(last_hidden_state))
63
+
64
+ def colbert_embedding(
65
+ self, last_hidden_state: Tensor, attention_mask: Tensor
66
+ ) -> Tensor:
67
+ with torch.no_grad():
68
+ colbert_vecs = self.colbert_linear(last_hidden_state[:, 1:])
69
+ colbert_vecs = colbert_vecs * attention_mask[:, 1:][:, :, None].float()
70
+ return colbert_vecs
71
+
72
+ def forward(self, input_ids: Tensor, attention_mask: Tensor) -> dict[str, Tensor]:
73
+ with torch.no_grad():
74
+ last_hidden_state = self.model(
75
+ input_ids=input_ids, attention_mask=attention_mask, return_dict=True
76
+ ).last_hidden_state
77
+
78
+ output = {}
79
+ dense_vecs = self.dense_embedding(last_hidden_state)
80
+ output["dense_vecs"] = torch.nn.functional.normalize(dense_vecs, dim=-1)
81
+
82
+ sparse_vecs = self.sparse_embedding(last_hidden_state)
83
+ output["sparse_vecs"] = sparse_vecs
84
+
85
+ colbert_vecs = self.colbert_embedding(last_hidden_state, attention_mask)
86
+ output["colbert_vecs"] = torch.nn.functional.normalize(colbert_vecs, dim=-1)
87
+
88
+ return output
89
+
90
+
91
+ class BGEM3OnnxConfig(XLMRobertaOnnxConfig):
92
+ @property
93
+ def outputs(self) -> dict[str, dict[int, str]]:
94
+ """
95
+ Dict containing the axis definition of the output tensors to provide to the model.
96
+ Returns:
97
+ `Dict[str, Dict[int, str]]`: A mapping of each output name to a mapping of axis position to the axes symbolic name.
98
+ """
99
+ return copy.deepcopy(
100
+ OrderedDict(
101
+ {
102
+ "dense_vecs": {0: "batch_size", 1: "embedding"},
103
+ "sparse_vecs": {0: "batch_size", 1: "token", 2: "weight"},
104
+ "colbert_vecs": {0: "batch_size", 1: "token", 2: "embedding"},
105
+ }
106
+ )
107
+ )
108
+
109
+
110
+
111
+ def main(output: str, opset: int, device: str, optimize: str, atol: str):
112
+ model = BGEM3InferenceModel()
113
+ bgem3_onnx_config = BGEM3OnnxConfig(model.config)
114
+
115
+ # Export to ONNX first
116
+ print("Exporting to ONNX...")
117
+
118
+ # Monkey-patch the library inference to return 'transformers'
119
+ original_infer = TasksManager.infer_library_from_model
120
+ TasksManager.infer_library_from_model = lambda model: "transformers"
121
+
122
+ try:
123
+ onnx_export_from_model(
124
+ model, # Use the full custom model
125
+ output=output,
126
+ task="feature-extraction",
127
+ custom_onnx_configs={"model": bgem3_onnx_config},
128
+ opset=opset,
129
+ optimize=optimize,
130
+ atol=atol,
131
+ device=device,
132
+ )
133
+ finally:
134
+ # Restore original function
135
+ TasksManager.infer_library_from_model = original_infer
136
+ print(f"ONNX model saved to: {output}")
137
+
138
+ # Apply quantization
139
+ print("Quantizing model...")
140
+ quantizer = ORTQuantizer.from_pretrained(output)
141
+ qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
142
+ print("Applying dynamic int8 quantization...")
143
+ quantized_path = f"{output}_int8"
144
+ quantizer.quantize(
145
+ save_dir=quantized_path,
146
+ quantization_config=qconfig
147
+ )
148
+ print(f"Quantized model saved to: {quantized_path}")
149
+
150
+ if __name__ == "__main__":
151
+ parser = argparse.ArgumentParser()
152
+ parser.add_argument(
153
+ "--output",
154
+ type=str,
155
+ default="onnx_model",
156
+ help="Path indicating the directory where to store the generated ONNX model.",
157
+ )
158
+ parser.add_argument(
159
+ "--opset",
160
+ type=int,
161
+ default=None,
162
+ help="If specified, ONNX opset version to export the model with. Otherwise, the default opset for the given model architecture will be used.",
163
+ )
164
+ parser.add_argument(
165
+ "--device",
166
+ type=str,
167
+ default="cpu",
168
+ help='The device to use to do the export. Defaults to "cpu".',
169
+ )
170
+ parser.add_argument(
171
+ "--optimize",
172
+ type=str,
173
+ default=None,
174
+ choices=["O1", "O2", "O3", "O4"],
175
+ help=(
176
+ "Allows to run ONNX Runtime optimizations directly during the export. Some of these optimizations are specific to ONNX Runtime, and the resulting ONNX will not be usable with other runtime as OpenVINO or TensorRT. Possible options:\n"
177
+ " - O1: Basic general optimizations\n"
178
+ " - O2: Basic and extended general optimizations, transformers-specific fusions\n"
179
+ " - O3: Same as O2 with GELU approximation\n"
180
+ " - O4: Same as O3 with mixed precision (fp16, GPU-only, requires `--device cuda`)"
181
+ ),
182
+ )
183
+ parser.add_argument(
184
+ "--atol",
185
+ type=float,
186
+ default=None,
187
+ help="If specified, the absolute difference tolerance when validating the model. Otherwise, the default atol for the model will be used.",
188
+ )
189
+ args = parser.parse_args()
190
+
191
+ main(args.output, args.opset, args.device, args.optimize, args.atol)
model_quantized.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16de7ea1146ca427e14938ec3e9abfdcaff0e6ac76434cd693ac35d761250bcb
3
+ size 569958496
ort_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "one_external_file": true,
3
+ "opset": null,
4
+ "optimization": {},
5
+ "quantization": {
6
+ "activations_dtype": "QUInt8",
7
+ "activations_symmetric": false,
8
+ "format": "QOperator",
9
+ "is_static": false,
10
+ "mode": "IntegerOps",
11
+ "nodes_to_exclude": [],
12
+ "nodes_to_quantize": [],
13
+ "operators_to_quantize": [
14
+ "Conv",
15
+ "MatMul",
16
+ "Attention",
17
+ "LSTM",
18
+ "Gather",
19
+ "Transpose",
20
+ "EmbedLayerNormalization"
21
+ ],
22
+ "per_channel": false,
23
+ "qdq_add_pair_to_weight": false,
24
+ "qdq_dedicated_pair": false,
25
+ "qdq_op_type_per_channel_support_to_axis": {
26
+ "MatMul": 1
27
+ },
28
+ "reduce_range": false,
29
+ "weights_dtype": "QInt8",
30
+ "weights_symmetric": true
31
+ },
32
+ "use_external_data_format": false
33
+ }
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ accelerate==1.8.1
2
+ huggingface-hub==0.33.0
3
+ onnx==1.18.0
4
+ onnxruntime==1.22.0
5
+ optimum==1.26.1
6
+ transformers==4.52.4
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:249df0778f236f6ece390de0de746838ef25b9d6954b68c2ee71249e0a9d8fd4
3
+ size 17082799
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "extra_special_tokens": {},
49
+ "mask_token": "<mask>",
50
+ "model_max_length": 8192,
51
+ "pad_token": "<pad>",
52
+ "sep_token": "</s>",
53
+ "sp_model_kwargs": {},
54
+ "tokenizer_class": "XLMRobertaTokenizer",
55
+ "unk_token": "<unk>"
56
+ }