SentenceTransformer

This is a sentence-transformers model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Maximum Sequence Length: 1024 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'What is the number one industry in the county of Surrey in England?',
    'Surrey\'s most significant source of prosperity in the later Middle Ages was the production of woollen cloth, which emerged during that period as England\'s main export industry. The county was an early centre of English textile manufacturing, benefiting from the presence of deposits of fuller\'s earth, the rare mineral composite important in the process of finishing cloth, around Reigate and Nutfield. The industry in Surrey was focused on Guildford, which gave its name to a variety of cloth, "gilforte", which was exported widely across Europe and the Middle East and imitated by manufacturers elsewhere in Europe. However, as the English cloth industry expanded, Surrey was outstripped by other growing regions of production.\nThough Surrey was not the scene of serious fighting in the various rebellions and civil wars of the period, armies from Kent heading for London via Southwark passed through what were then the extreme north-eastern fringes of Surrey during the Peasants\' Revolt of 1381 and Cade\'s Rebellion in 1450, and at various stages of the Wars of the Roses in 1460, 1469 and 1471. The upheaval of 1381 also involved widespread local unrest in Surrey, as was the case all across south-eastern England, and some recruits from Surrey joined the Kentish rebel army.',
    'Surrey County Cricket Club is one of eighteen first-class county clubs within the domestic cricket structure of England and Wales. It represents the historic county of Surrey and also South London. The club\'s limited overs team is called "Surrey" (unlike most other counties\' teams, it has no official nickname). The club was founded in 1845 but teams representing the county have played top-class cricket since the early 18th century and the club has always held first-class status. Surrey have competed in the County Championship since the official start of the competition in 1890 and have played in every top-level domestic cricket competition in England.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 185,295 training samples
  • Columns: anchor, positive, negative, negative_2, negative_3, and negative_4
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative negative_2 negative_3 negative_4
    type string string string string string string
    details
    • min: 6 tokens
    • mean: 11.81 tokens
    • max: 26 tokens
    • min: 17 tokens
    • mean: 169.73 tokens
    • max: 986 tokens
    • min: 12 tokens
    • mean: 181.7 tokens
    • max: 759 tokens
    • min: 22 tokens
    • mean: 184.76 tokens
    • max: 817 tokens
    • min: 14 tokens
    • mean: 186.03 tokens
    • max: 859 tokens
    • min: 12 tokens
    • mean: 179.54 tokens
    • max: 759 tokens
  • Samples:
    anchor positive negative negative_2 negative_3 negative_4
    When was quantum field theory developed? The third thread in the development of quantum field theory was the need to handle the statistics of many-particle systems consistently and with ease. In 1927, Pascual Jordan tried to extend the canonical quantization of fields to the many-body wave functions of identical particles using a formalism which is known as statistical transformation theory; this procedure is now sometimes called second quantization. In 1928, Jordan and Eugene Wigner found that the quantum field describing electrons, or other fermions, had to be expanded using anti-commuting creation and annihilation operators due to the Pauli exclusion principle (see Jordan–Wigner transformation). This thread of development was incorporated into many-body theory and strongly influenced condensed matter physics and nuclear physics. The application of the new quantum theory to electromagnetism resulted in quantum field theory, which was developed starting around 1930. Quantum field theory has driven the development of more sophisticated formulations of quantum mechanics, of which the ones presented here are simple special cases. Two classic text-books from the 1960s, James D. Bjorken, Sidney David Drell, "Relativistic Quantum Mechanics" (1964) and J. J. Sakurai, "Advanced Quantum Mechanics" (1967), thoroughly developed the Feynman graph expansion techniques using physically intuitive and practical methods following from the correspondence principle, without worrying about the technicalities involved in deriving the Feynman rules from the superstructure of quantum field theory itself. Although both Feynman's heuristic and pictorial style of dealing with the infinities, as well as the formal methods of Tomonaga and Schwinger, worked extremely well, and gave spectacularly accurate answers, the true analytical nature of the question of "renormalizability", that is, whether ANY theory formulated as a "quantum field theory" would give finite answers, was not worked-out until much later, when the urgency of trying to formulate finite theories for the strong and electro-weak (and gravitational interactions) demanded i... It was evident from the beginning that a proper quantum treatment of the electromagnetic field had to somehow incorporate Einstein's relativity theory, which had grown out of the study of classical electromagnetism. This need to put together relativity and quantum mechanics was the second major motivation in the development of quantum field theory. Pascual Jordan and Wolfgang Pauli showed in 1928 that quantum fields could be made to behave in the way predicted by special relativity during coordinate transformations (specifically, they showed that the field commutators were Lorentz invariant). A further boost for quantum field theory came with the discovery of the Dirac equation, which was originally formulated and interpreted as a single-particle equation analogous to the Schrödinger equation, but unlike the Schrödinger equation, the Dirac equation satisfies both the Lorentz invariance, that is, the requirements of special relativity, and the rules of quantum mechanics.
    The Dirac equa...
    Through the works of Born, Heisenberg, and Pascual Jordan in 1925-1926, a quantum theory of the free electromagnetic field (one with no interactions with matter) was developed via canonical quantization by treating the electromagnetic field as a set of quantum harmonic oscillators. With the exclusion of interactions, however, such a theory was yet incapable of making quantitative predictions about the real world.
    Was there a year 0? Cassini gave the following reasons for using a year 0:
    Fred Espanak of NASA lists 50 phases of the moon within year 0, showing that it is a full year, not an instant in time. Jean Meeus gives the following explanation:
    Although he used the usual French terms "avant J.-C." (before Jesus Christ) and "après J.-C." (after Jesus Christ) to label years elsewhere in his book, the Byzantine historian Venance Grumel used negative years (identified by a minus sign, −) to label BC years and unsigned positive years to label AD years in a table. He did so possibly to save space and put no year 0 between them.
    Games Def Interceptions Fumbles Sacks & Tackles
    Year Age Tm Pos No. G GS Int Yds TD Lng PD FF Fmb FR Yds TD Sk Tkl Ast Sfty AV
    2004 23 NWE ss 42 13 2 0 0 0 0 2 1 0 1 0 0 15 8 2
    2005 24 IND 36 16 0 1 0 1 0 0 8 2 1
    2006 25 IND ss 36 10 1 0 0 0 0 2 11 0 1
    Career 39 3 0 0 0 0 4 2 0 2 0 0 34 10 4
    2 yrs IND 26 1 0 0 0 0 2 1 0 1 0 0 19 2 2
    1 yr NWE 13 2 0 0 0 0 2 1 0 1 0 0 15 8 2
    After pleading guilty in January 2008 to drug charges in Virginia Beach, VA stemming from a March 2007 incident, Reid was initially sentenced to two years in prison for possessing marijuana with the intent to distribute but had the sentence suspended with the agreement he would stay out of trouble for two years. His license was also suspended for six months and ordered to attend drug treatment and counseling.
    This enzyme belongs to the family of oxidoreductases, specifically those acting on paired donors, with O2 as oxidant and incorporation or reduction of oxygen. The oxygen incorporated need not be derived from O2 with 2-oxoglutarate as one donor, and incorporation of one atom o oxygen into each donor. The systematic name of this enzyme class is N6,N6,N6-trimethyl-L-lysine,2-oxoglutarate:oxygen oxidoreductase (3-hydroxylating). Other names in common use include trimethyllysine alpha-ketoglutarate dioxygenase, TML-alpha-ketoglutarate dioxygenase, TML hydroxylase, 6-N,6-N,6-N-trimethyl-L-lysine,2-oxoglutarate:oxygen oxidoreductase, and (3-hydroxylating). This enzyme participates in lysine degradation and L-carnitine biosynthesis and requires the presence of iron and ascorbate. ㅜ is one of the Korean hangul. The Unicode for ㅜ is U+315C. ㅌ is one of the Korean hangul. The Unicode for ㅌ is U+314C.
    When is the dialectical method used? The Dialect Test was created by A.J. Ellis in February 1879, and was used in the fieldwork for his work "On Early English Pronunciation". It stands as one of the earliest methods of identifying vowel sounds and features of speech. The aim was to capture the main vowel sounds of an individual dialect by listening to the reading of a short passage. All the categories of West Saxon words and vowels were included in the test so that comparisons could be made with the historic West Saxon speech as well as with various other dialects. Karl Popper has attacked the dialectic repeatedly. In 1937, he wrote and delivered a paper entitled "What Is Dialectic?" in which he attacked the dialectical method for its willingness "to put up with contradictions". Popper concluded the essay with these words: "The whole development of dialectic should be a warning against the dangers inherent in philosophical system-building. It should remind us that philosophy should not be made a basis for any sort of scientific system and that philosophers should be much more modest in their claims. One task which they can fulfill quite usefully is the study of the critical methods of science" (Ibid., p. 335). He was one of the first to apply Labovian methods in Britain with his research in 1970-1 on the speech of Bradford, Halifax and Huddersfield. He concluded that the speech detailed in most of dialectology (e.g. A. J. Ellis, the Survey of English Dialects) had virtually disappeared, having found only one speaker out of his sample of 106 speakers who regularly used dialect. However, he found that differences in speech persisted as an indicator of social class, age and gender. This PhD dissertation was later adapted into a book, "Dialect and Accent in Industrial West Yorkshire". The work was criticised by Graham Shorrocks on the grounds that the sociolinguistic methods used were inappropriate for recording the traditional vernacular and that there was an inadequate basis for comparison with earlier dialect studies in West Yorkshire. The Institute also attempted to reformulate dialectics as a concrete method. The use of such a dialectical method can be traced back to the philosophy of Hegel, who conceived dialectic as the tendency of a notion to pass over into its own negation as the result of conflict between its inherent contradictory aspects. In opposition to previous modes of thought, which viewed things in abstraction, each by itself and as though endowed with fixed properties, Hegelian dialectic has the ability to consider ideas according to their movement and change in time, as well as according to their interrelations and interactions. For Marx, dialectics is not a formula for generating predetermined outcomes but is a method for the empirical study of social processes in terms of interrelations, development, and transformation. In his introduction to the Penguin edition of Marx's "Capital", Ernest Mandel writes, "When the dialectical method is applied to the study of economic problems, economic phenomena are not viewed separately from each other, by bits and pieces, but in their inner connection as an integrated totality, structured around, and by, a basic predominant mode of production."
  • Loss: CachedGISTEmbedLoss with these parameters:
    {'guide': SentenceTransformer(
      (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
      (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
      (2): Normalize()
    ), 'temperature': 0.01}
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 1024
  • learning_rate: 3e-05
  • weight_decay: 0.01
  • num_train_epochs: 4
  • warmup_ratio: 0.05
  • bf16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 1024
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 3e-05
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.05
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: True
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss
0.0222 1 0.2146
0.0444 2 0.2363
0.0667 3 0.2584
0.0889 4 0.2285
0.1111 5 0.253
0.1333 6 0.2225
0.1556 7 0.2419
0.1778 8 0.2026
0.2 9 0.2228
0.2222 10 0.2398
0.2444 11 0.2106
0.2667 12 0.2308
0.2889 13 0.2172
0.3111 14 0.2461
0.3333 15 0.2345
0.3556 16 0.2327
0.3778 17 0.2219
0.4 18 0.2032
0.4222 19 0.218
0.4444 20 0.2465
0.4667 21 0.2164
0.4889 22 0.2262
0.5111 23 0.221
0.5333 24 0.208
0.5556 25 0.2185
0.5778 26 0.2298
0.6 27 0.2241
0.6222 28 0.221
0.6444 29 0.2123
0.6667 30 0.2087
0.6889 31 0.2107
0.7111 32 0.208
0.7333 33 0.2318
0.7556 34 0.2069
0.7778 35 0.2194
0.8 36 0.2229
0.8222 37 0.2015
0.8444 38 0.2093
0.8667 39 0.1916
0.8889 40 0.2128
0.9111 41 0.2105
0.9333 42 0.2227
0.9556 43 0.1904
0.9778 44 0.2101
1.0 45 0.1994
1.0222 46 0.1429
1.0444 47 0.1315
1.0667 48 0.1492
1.0889 49 0.1471
1.1111 50 0.1495
1.1333 51 0.1306
1.1556 52 0.159
1.1778 53 0.1446
1.2 54 0.1347
1.2222 55 0.1391
1.2444 56 0.1302
1.2667 57 0.1326
1.2889 58 0.1392
1.3111 59 0.1306
1.3333 60 0.1396
1.3556 61 0.1512
1.3778 62 0.1355
1.4 63 0.1296
1.4222 64 0.1462
1.4444 65 0.1626
1.4667 66 0.1378
1.4889 67 0.1297
1.5111 68 0.1328
1.5333 69 0.1441
1.5556 70 0.131
1.5778 71 0.1191
1.6 72 0.1459
1.6222 73 0.1333
1.6444 74 0.1342
1.6667 75 0.1335
1.6889 76 0.1398
1.7111 77 0.1335
1.7333 78 0.1335
1.7556 79 0.1315
1.7778 80 0.1436
1.8 81 0.1483
1.8222 82 0.1166
1.8444 83 0.1295
1.8667 84 0.1322
1.8889 85 0.1339
1.9111 86 0.1428
1.9333 87 0.131
1.9556 88 0.1202
1.9778 89 0.119
2.0 90 0.1392
2.0222 91 0.1042
2.0444 92 0.0966
2.0667 93 0.1174
2.0889 94 0.0935
2.1111 95 0.1051
2.1333 96 0.0887
2.1556 97 0.1057
2.1778 98 0.1062
2.2 99 0.1022
2.2222 100 0.1005
2.2444 101 0.1022
2.2667 102 0.0883
2.2889 103 0.0882
2.3111 104 0.0871
2.3333 105 0.1102
2.3556 106 0.0887
2.3778 107 0.0986
2.4 108 0.0955
2.4222 109 0.0975
2.4444 110 0.0956
2.4667 111 0.1062
2.4889 112 0.0961
2.5111 113 0.0956
2.5333 114 0.0945
2.5556 115 0.11
2.5778 116 0.1044
2.6 117 0.0997
2.6222 118 0.0936
2.6444 119 0.0952
2.6667 120 0.1049
2.6889 121 0.0936
2.7111 122 0.0959
2.7333 123 0.0833
2.7556 124 0.0867
2.7778 125 0.1007
2.8 126 0.0913
2.8222 127 0.0885
2.8444 128 0.1055
2.8667 129 0.0952
2.8889 130 0.086
2.9111 131 0.0954
2.9333 132 0.1055
2.9556 133 0.0935
2.9778 134 0.0909
3.0 135 0.0898
3.0222 136 0.0762
3.0444 137 0.0811
3.0667 138 0.0825
3.0889 139 0.0704
3.1111 140 0.0753
3.1333 141 0.0761
3.1556 142 0.0802
3.1778 143 0.0794
3.2 144 0.0693
3.2222 145 0.0782
3.2444 146 0.0669
3.2667 147 0.0842
3.2889 148 0.0789
3.3111 149 0.0718
3.3333 150 0.0807
3.3556 151 0.0773
3.3778 152 0.0766
3.4 153 0.0714
3.4222 154 0.0767
3.4444 155 0.075
3.4667 156 0.0762
3.4889 157 0.0829
3.5111 158 0.0731
3.5333 159 0.0704
3.5556 160 0.0741
3.5778 161 0.0804
3.6 162 0.0744
3.6222 163 0.0715
3.6444 164 0.0737
3.6667 165 0.0709
3.6889 166 0.0765
3.7111 167 0.077
3.7333 168 0.0756
3.7556 169 0.0743
3.7778 170 0.0758
3.8 171 0.0817
3.8222 172 0.0746
3.8444 173 0.0875
3.8667 174 0.0741
3.8889 175 0.0704
3.9111 176 0.0831
3.9333 177 0.079
3.9556 178 0.0777
3.9778 179 0.0763
4.0 180 0.0762

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.1
  • Transformers: 4.49.0
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.4.0
  • Datasets: 2.21.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
5
Safetensors
Model size
568M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support