--- tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:21769 - loss:MultipleNegativesRankingLoss base_model: Lajavaness/bilingual-embedding-large widget: - source_sentence: 'Go bobO..... Take this one! Slap in the face! MAP Leda Beck 5h- A Spanish biologist, researcher: - You pay 1 million euros a month to a football player and 1,800 euros for a Biology researcher. Now you want one treatment. Will you ask Cristiano Ronaldo or to Messi and they will find a cure for you.' sentences: - Treat stroke using a needle Doctors warn against ‘dangerously misleading’ posts claiming you can treat a stroke with a needle - This Spanish biologist said that Cristiano Ronaldo and Messi must find a cure for the new coronavirus as they earn much more than scientists The woman in these posts is a Spanish politician who has made no statements about Messi, CR7, or the cure for COVID-19. - The Simpsons predicted 2022 Canada trucker protests Footage from The Simpsons was edited to look like the show predicted Canada’s Covid truckers protest - source_sentence: 'This is what the rivers of Colombia are becoming... Disappeared people floating in the rivers, killed by Duque''s assassins This is what the rivers of Cali are becoming In transportation of our dead at hands of duqueh''s henchmen: 242' sentences: - A photograph shows bodies floated in a river in Cali The photo of black bags in a river is a tribute to those killed in the protests in Colombia - Sri Lankan doctor created COVID-19 rapid test kits The doctor interviewed in this report did not say he was involved in the development of COVID-19 test kits - Masks are meant to protect the vaccinated Face mask requirements aim to protect unvaccinated people - source_sentence: 'How can you say it proudly that you are leaders of SA... When it looks like a dumping site nd you living high lavishly life in your porsh houses built out of hard earned Tax payers money? CRY SA OUR BELOVED COUNTRY ' sentences: - Donald Trump next to a stack of declassified files Trump did not pose in this photo with declassified files, but with federal regulations - Images show trash-strewn streets in South Africa These photos of messy Johannesburg streets are old and taken out of context - SBT will again air the program "A Semana do Presidente" with Bolsonaro There is no forecast for the return of "A Semana do Presidente" on SBT, despite a project in 2020 - source_sentence: 'First photos of Earth sent by India''s Chadrayan2 space mission. Breathtaking. ' sentences: - This nest of bats is the source of the coronavirus in Wuhan China The video of a roof infested with bats was recorded in 2011 in the United States - Australia recalled 50 million doses of Covid vaccine No, Australia has not recalled 50 million doses of a Covid vaccine - First photos of Earth sent by India's Chadrayan2 space mission These alleged photos of Earth have no connection to the Chandrayaan-2 lunar mission. - source_sentence: Even if you remove JIMENEZ ....... IF THERE IS STILL A SMARTMATIC THAT IS GOOD AT MAGIC THERE IS ALSO NO ..... SMARTMATIC SHOULD BE REMOVED FROM THE COMELEC CONTRACT ..... BECAUSE THE DEMON COMELEC HAS LONG HONORED THE VOTE OF MANY PEOPLE ..... AS LONG AS THERE ARE COMMISSIONERS IN THE COMELEC WHO LOOK LIKE MONEY, WE WILL NOT HAVE A CLEAN ELECTION ....... JUST IMAGINE HOW LONG THE ISSUE SPREADS THAT IF A CANDIDATE WANTS TO WIN, IT WILL PAY THE COMELEC 25 MILLION ???????????????????????????? ? SO ARE THE ELECTION RESULTS HOKOS POKOS ?????????????????????? DEMONS ...... SO ALL THE PUNISHMENT OF HEAVEN HAS BEEN GIVEN IN THE PHILIPPINES BECAUSE TANING LIVES WITH US ...... THE THOUGHT IS PURE MONEY ..... SO EVEN ELECTIONS ARE MONEY ..... ..... 7:08 AM 4G 51% FINALLY, COMELEC OFFICIAL JIMENEZ, REMOVED IN PLACE. BY PRRD AND OTHERS AGAIN THIS. FOR CLEAN NOW ELECTION TO COMING 2022 ELECTION sentences: - The WHO declared covid-19 an endemic disease Although it considers it probable, the WHO has not yet declared covid-19 an endemic disease - Israel, the only country with four vaccines, broke the record for covid-19 cases Israel has not immunized its entire population with 4 doses in January 2022 and the vaccines are effective - Philippine President Rodrigo Duterte fired Comelec spokesman James Jimenez in May 2021 Posts misleadingly claim Philippine president fired poll body spokesman pipeline_tag: sentence-similarity library_name: sentence-transformers --- # SentenceTransformer based on Lajavaness/bilingual-embedding-large This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Lajavaness/bilingual-embedding-large](https://huggingface.co/Lajavaness/bilingual-embedding-large). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [Lajavaness/bilingual-embedding-large](https://huggingface.co/Lajavaness/bilingual-embedding-large) - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 1024 dimensions - **Similarity Function:** Cosine Similarity ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("sentence_transformers_model_id") # Run inference sentences = [ 'Even if you remove JIMENEZ ....... IF THERE IS STILL A SMARTMATIC THAT IS GOOD AT MAGIC THERE IS ALSO NO ..... SMARTMATIC SHOULD BE REMOVED FROM THE COMELEC CONTRACT ..... BECAUSE THE DEMON COMELEC HAS LONG HONORED THE VOTE OF MANY PEOPLE ..... AS LONG AS THERE ARE COMMISSIONERS IN THE COMELEC WHO LOOK LIKE MONEY, WE WILL NOT HAVE A CLEAN ELECTION ....... JUST IMAGINE HOW LONG THE ISSUE SPREADS THAT IF A CANDIDATE WANTS TO WIN, IT WILL PAY THE COMELEC 25 MILLION ???????????????????????????? ? SO ARE THE ELECTION RESULTS HOKOS POKOS ?????????????????????? DEMONS ...... SO ALL THE PUNISHMENT OF HEAVEN HAS BEEN GIVEN IN THE PHILIPPINES BECAUSE TANING LIVES WITH US ...... THE THOUGHT IS PURE MONEY ..... SO EVEN ELECTIONS ARE MONEY ..... ..... 7:08 AM 4G 51% FINALLY, COMELEC OFFICIAL JIMENEZ, REMOVED IN PLACE. BY PRRD AND OTHERS AGAIN THIS. FOR CLEAN NOW ELECTION TO COMING 2022 ELECTION', 'Philippine President Rodrigo Duterte fired Comelec spokesman James Jimenez in May 2021 Posts misleadingly claim Philippine president fired poll body spokesman', 'The WHO declared covid-19 an endemic disease Although it considers it probable, the WHO has not yet declared covid-19 an endemic disease', ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 1024] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] ``` ## Training Details ### Training Dataset #### Unnamed Dataset * Size: 21,769 training samples * Columns: sentence_0 and sentence_1 * Approximate statistics based on the first 1000 samples: | | sentence_0 | sentence_1 | |:--------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------| | type | string | string | | details | | | * Samples: | sentence_0 | sentence_1 | |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | "On January 1, 1979 New York billionaire Brandon Torrent allowed himself to be photographed while urinating on a homeless man sleeping on the street. This image explains, better than many words, the division of the world into social classes that we must eliminate . Meanwhile, in 21st century Brazil, many 'good citizens', just above the homeless condition, applaud politicians and politicians* who support the predatory elite represented by this abject and unworthy human being, who urinates on people who, in the final analysis, are the builders of the fortune he enjoys. Until we realize which side of this stream of urine we are on, we will not be able to build a truly just society. Class consciousness is the true and most urgent education." | This photo shows a billionaire named Brandon Torrent urinating on a homeless man The real story behind the image of a man who appears to urinate on a homeless person | | French secret service officer jean claude returns from his mission as imam with deash (isis) like others from several countries in Syria.. there are questions | This man is a French intelligence officer No, this man is not a French intelligence officer | | Oh yes! Rohit Sharma Mumbai Indians Burj Khalifa DIEL 82 SAMSUNG MUMBAI INDIANS | Dubai’s Burj Khalifa skyscraper displays photo of Indian cricketer Rohit Sharma This image of the Burj Khalifa has been doctored – the original does not show a projection of Indian cricketer Rohit Sharma | * Loss: [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters: ```json { "scale": 20.0, "similarity_fct": "cos_sim" } ``` ### Training Hyperparameters #### Non-Default Hyperparameters - `per_device_train_batch_size`: 2 - `per_device_eval_batch_size`: 2 - `num_train_epochs`: 1 - `multi_dataset_batch_sampler`: round_robin #### All Hyperparameters
Click to expand - `overwrite_output_dir`: False - `do_predict`: False - `eval_strategy`: no - `prediction_loss_only`: True - `per_device_train_batch_size`: 2 - `per_device_eval_batch_size`: 2 - `per_gpu_train_batch_size`: None - `per_gpu_eval_batch_size`: None - `gradient_accumulation_steps`: 1 - `eval_accumulation_steps`: None - `torch_empty_cache_steps`: None - `learning_rate`: 5e-05 - `weight_decay`: 0.0 - `adam_beta1`: 0.9 - `adam_beta2`: 0.999 - `adam_epsilon`: 1e-08 - `max_grad_norm`: 1 - `num_train_epochs`: 1 - `max_steps`: -1 - `lr_scheduler_type`: linear - `lr_scheduler_kwargs`: {} - `warmup_ratio`: 0.0 - `warmup_steps`: 0 - `log_level`: passive - `log_level_replica`: warning - `log_on_each_node`: True - `logging_nan_inf_filter`: True - `save_safetensors`: True - `save_on_each_node`: False - `save_only_model`: False - `restore_callback_states_from_checkpoint`: False - `no_cuda`: False - `use_cpu`: False - `use_mps_device`: False - `seed`: 42 - `data_seed`: None - `jit_mode_eval`: False - `use_ipex`: False - `bf16`: False - `fp16`: False - `fp16_opt_level`: O1 - `half_precision_backend`: auto - `bf16_full_eval`: False - `fp16_full_eval`: False - `tf32`: None - `local_rank`: 0 - `ddp_backend`: None - `tpu_num_cores`: None - `tpu_metrics_debug`: False - `debug`: [] - `dataloader_drop_last`: False - `dataloader_num_workers`: 0 - `dataloader_prefetch_factor`: None - `past_index`: -1 - `disable_tqdm`: False - `remove_unused_columns`: True - `label_names`: None - `load_best_model_at_end`: False - `ignore_data_skip`: False - `fsdp`: [] - `fsdp_min_num_params`: 0 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} - `fsdp_transformer_layer_cls_to_wrap`: None - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} - `deepspeed`: None - `label_smoothing_factor`: 0.0 - `optim`: adamw_torch - `optim_args`: None - `adafactor`: False - `group_by_length`: False - `length_column_name`: length - `ddp_find_unused_parameters`: None - `ddp_bucket_cap_mb`: None - `ddp_broadcast_buffers`: False - `dataloader_pin_memory`: True - `dataloader_persistent_workers`: False - `skip_memory_metrics`: True - `use_legacy_prediction_loop`: False - `push_to_hub`: False - `resume_from_checkpoint`: None - `hub_model_id`: None - `hub_strategy`: every_save - `hub_private_repo`: None - `hub_always_push`: False - `gradient_checkpointing`: False - `gradient_checkpointing_kwargs`: None - `include_inputs_for_metrics`: False - `include_for_metrics`: [] - `eval_do_concat_batches`: True - `fp16_backend`: auto - `push_to_hub_model_id`: None - `push_to_hub_organization`: None - `mp_parameters`: - `auto_find_batch_size`: False - `full_determinism`: False - `torchdynamo`: None - `ray_scope`: last - `ddp_timeout`: 1800 - `torch_compile`: False - `torch_compile_backend`: None - `torch_compile_mode`: None - `dispatch_batches`: None - `split_batches`: None - `include_tokens_per_second`: False - `include_num_input_tokens_seen`: False - `neftune_noise_alpha`: None - `optim_target_modules`: None - `batch_eval_metrics`: False - `eval_on_start`: False - `use_liger_kernel`: False - `eval_use_gather_object`: False - `average_tokens_across_devices`: False - `prompts`: None - `batch_sampler`: batch_sampler - `multi_dataset_batch_sampler`: round_robin
### Training Logs | Epoch | Step | Training Loss | |:------:|:-----:|:-------------:| | 0.0459 | 500 | 0.0329 | | 0.0919 | 1000 | 0.0296 | | 0.1378 | 1500 | 0.0314 | | 0.1837 | 2000 | 0.0199 | | 0.2297 | 2500 | 0.0435 | | 0.2756 | 3000 | 0.0213 | | 0.3215 | 3500 | 0.0293 | | 0.3675 | 4000 | 0.0387 | | 0.4134 | 4500 | 0.0064 | | 0.4593 | 5000 | 0.0338 | | 0.5053 | 5500 | 0.0317 | | 0.5512 | 6000 | 0.0395 | | 0.5972 | 6500 | 0.0129 | | 0.6431 | 7000 | 0.036 | | 0.6890 | 7500 | 0.0292 | | 0.7350 | 8000 | 0.0215 | | 0.7809 | 8500 | 0.02 | | 0.8268 | 9000 | 0.0215 | | 0.8728 | 9500 | 0.0139 | | 0.9187 | 10000 | 0.0273 | | 0.9646 | 10500 | 0.0138 | ### Framework Versions - Python: 3.11.11 - Sentence Transformers: 3.4.1 - Transformers: 4.48.3 - PyTorch: 2.5.1+cu124 - Accelerate: 1.3.0 - Datasets: 3.3.2 - Tokenizers: 0.21.0 ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MultipleNegativesRankingLoss ```bibtex @misc{henderson2017efficient, title={Efficient Natural Language Response Suggestion for Smart Reply}, author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, year={2017}, eprint={1705.00652}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```