ImranzamanML commited on
Commit
c5c27b3
·
verified ·
1 Parent(s): 3032f8e

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - de
4
+ library_name: sentence-transformers
5
+ tags:
6
+ - sentence-transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ - loss:MatryoshkaLoss
10
+ base_model: aari1995/gbert-large-2
11
+ metrics:
12
+ - spearman_cosine
13
+ widget:
14
+ - source_sentence: Ein Mann übt Boxen
15
+ sentences:
16
+ - Ein Affe praktiziert Kampfsportarten.
17
+ - Eine Person faltet ein Blatt Papier.
18
+ - Eine Frau geht mit ihrem Hund spazieren.
19
+ - source_sentence: Zwei Frauen laufen.
20
+ sentences:
21
+ - Frauen laufen.
22
+ - Die Frau prüft die Augen des Mannes.
23
+ - Ein Mann ist auf einem Dach
24
+ pipeline_tag: sentence-similarity
25
+ license: apache-2.0
26
+ ---
27
+
28
+ # 🇩🇪 German Semantic V3b 🇩🇪
29
+ ### (and [**German_Semantic_V3**](https://huggingface.co/aari1995/German_Semantic_V3))
30
+
31
+ The successors of [German_Semantic_STS_V2](https://huggingface.co/aari1995/German_Semantic_STS_V2) are here and come with loads of cool new features! While [German_Semantic_V3](https://huggingface.co/aari1995/German_Semantic_V3) is really knowledge-heavy, V3b is more focused on performance. Feel free to provide feedback on the model and what you would like to see next.
32
+
33
+ **Note:** To run this model properly, see "Usage".
34
+
35
+ # Major updates and USPs:
36
+
37
+ - **Flexibility:** Trained with flexible sequence-length and embedding truncation, flexibility is a core feature of the model. Yet, smaller dimensions bring a minor trade-off in quality.
38
+ - **Sequence length:** Embed up to 8192 tokens (16 times more than V2 and other models)
39
+ - **Matryoshka Embeddings:** The model is trained for embedding sizes from 1024 down to 64, allowing you to store much smaller embeddings with little quality loss.
40
+ - **German only:** This model is German-only, it has rich cultural knowledge about Germany and German topics. Therefore, also the model to learn more efficient thanks to its tokenizer, deal better with shorter queries and generally be more nuanced in many scenarios.
41
+ - **Typo and Casing**: This model was trained to be robust against minor typos and casing, leading to slightly weaker benchmark performance and learning during training, but higher robustness of the embeddings.
42
+ - **Pooling Function:** Moving away from mean pooling towards using the CLS token. Generally seems to learn better after the stage-2 pretraining and allows for more flexibility.
43
+ - **License:** Apache 2.0
44
+
45
+
46
+ # Usage:
47
+
48
+ This model has some build-in functionality that is rather hidden. To profit from it, use this code:
49
+
50
+ ```python
51
+ from sentence_transformers import SentenceTransformer
52
+
53
+
54
+ matryoshka_dim = 1024 # How big your embeddings should be, choose from: 64, 128, 256, 512, 768, 1024
55
+ model = SentenceTransformer("aari1995/German_Semantic_V3", trust_remote_code=True, truncate_dim=matryoshka_dim)
56
+
57
+ # model.truncate_dim = 64 # truncation dimensions can also be changed after loading
58
+ # model.max_seq_length = 512 #optionally, set your maximum sequence length lower if your hardware is limited
59
+
60
+ # Run inference
61
+ sentences = [
62
+ 'Eine Flagge weht.',
63
+ 'Die Flagge bewegte sich in der Luft.',
64
+ 'Zwei Personen beobachten das Wasser.',
65
+ ]
66
+
67
+ # For FP16 embeddings (half space, no quality loss)
68
+ embeddings = model.encode(sentences, convert_to_tensor=True).half()
69
+
70
+ # For FP32 embeddings (takes more space)
71
+ # embeddings = model.encode(sentences)
72
+
73
+ # Get the similarity scores for the embeddings
74
+ similarities = model.similarity(embeddings, embeddings)
75
+
76
+ ```
77
+
78
+
79
+ # FAQ
80
+
81
+ **Q: Is this Model better than V2?**
82
+
83
+ **A:** In terms of flexibility, this model is better. Performance wise, in most of the experiments this model is also better.
84
+
85
+
86
+ **Q: What is the difference between V3 and V3b?**
87
+
88
+ **A:** V3 is slightly worse on benchmarks, while V3b has a knowledge cutoff by 2020, so it really depends on your use-case what model to use.
89
+
90
+ If you want peak performance and do not worry too much about recent developments, take this one (V3b).
91
+
92
+ If you are fine with sacrificing a few points on benchmarks and want the model to know what happened from 2020 on (elections, covid, other cultural events etc.), I'd suggest you use [German_Semantic_V3](https://huggingface.co/aari1995/German_Semantic_V3).
93
+
94
+ Another noticable difference is that V3 has a broader cosine_similarity spectrum, reaching from -1 to 1 (but mostly, the least is over -0.2). On the other side, V3b is more aligned with V2 and the similarity spectrum is around 0 to 1. Also, V3 uses cls_pooling while V3b uses mean_pooling.
95
+
96
+ **Q: How does the model perform vs. multilingual models?**
97
+
98
+ **A:** There are really great multilingual models that will be very useful for many use-cases. This model shines with its cultural knowledge and knowledge about German people and behaviour.
99
+
100
+
101
+ **Q: What is the trade-off when reducing the embedding size?**
102
+
103
+ **A:** Broadly speaking, when going from 1024 to 512 dimensions, there is very little trade-off (1 percent). When going down to 64 dimensions, you may face a decrease of up to 3 percent.
104
+
105
+
106
+ # Evaluation
107
+
108
+ Storage comparison:
109
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f3801ab7e583543386217ac/Aa5WzHanj-DXc86AKxpEz.png)
110
+
111
+ Benchmarks: soon.
112
+
113
+ # Up next:
114
+ German_Semantic_V3_Instruct: Guiding your embeddings towards self-selected aspects. - planned: 2024.
115
+
116
+ # Thank You and Credits
117
+
118
+ - To [jinaAI](https://huggingface.co/jinaai) for their BERT implementation that is used, especially ALiBi
119
+ - To [deepset](https://huggingface.co/deepset) for the gbert-large, which is a really great model
120
+ - To [occiglot](https://huggingface.co/occiglot) and OSCAR for their data used to pre-train the model
121
+ - To [Tom](https://huggingface.co/tomaarsen), especially for sentence-transformers, [Björn and Jan from ellamind](https://ellamind.com/de/) for the consultation
122
+ - To [Meta](https://huggingface.co/facebook) for XNLI which is used in variations
123
+
124
+ Idea, Training and Implementation by Aaron Chibb
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "aari1995/German_Semantic_V3b",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "aari1995/German_Semantic_V3b--configuration_bert.JinaBertConfig",
9
+ "AutoModel": "aari1995/German_Semantic_V3b--modeling_bert.JinaBertModel",
10
+ "AutoModelForMaskedLM": "aari1995/German_Semantic_V3b--modeling_bert.JinaBertForMaskedLM",
11
+ "AutoModelForSequenceClassification": "aari1995/German_Semantic_V3b--modeling_bert.JinaBertForSequenceClassification"
12
+ },
13
+ "classifier_dropout": null,
14
+ "emb_pooler": null,
15
+ "feed_forward_type": "original",
16
+ "hidden_act": "gelu",
17
+ "hidden_dropout_prob": 0.1,
18
+ "hidden_size": 1024,
19
+ "initializer_range": 0.02,
20
+ "intermediate_size": 4096,
21
+ "layer_norm_eps": 1e-12,
22
+ "max_position_embeddings": 8192,
23
+ "model_type": "bert",
24
+ "num_attention_heads": 16,
25
+ "num_hidden_layers": 24,
26
+ "pad_token_id": 0,
27
+ "position_embedding_type": "alibi",
28
+ "torch_dtype": "float32",
29
+ "transformers_version": "4.48.2",
30
+ "type_vocab_size": 2,
31
+ "use_cache": true,
32
+ "vocab_size": 31102
33
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.48.2",
5
+ "pytorch": "2.5.1+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b442c2a8627438b6013b20d642c2b5d3e1d40e78a0a9e4ef03908925a68661a
3
+ size 1374445424
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "101": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "102": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "103": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "104": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "max_len": 9999999999,
51
+ "max_length": 8192,
52
+ "model_max_length": 8192,
53
+ "never_split": null,
54
+ "pad_to_multiple_of": null,
55
+ "pad_token": "[PAD]",
56
+ "pad_token_type_id": 0,
57
+ "padding_side": "right",
58
+ "sep_token": "[SEP]",
59
+ "stride": 0,
60
+ "strip_accents": false,
61
+ "tokenize_chinese_chars": true,
62
+ "tokenizer_class": "BertTokenizer",
63
+ "truncation_side": "right",
64
+ "truncation_strategy": "longest_first",
65
+ "unk_token": "[UNK]"
66
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff