Korean Embedding Models
Collection
A collection of embedding models that are optimized for understanding and representing Korean text.
•
2 items
•
Updated
This is a sentence-transformers model trained on the train_set dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("dragonkue/bge-m3-ko")
# Run inference
sentences = [
'수급권자 중 근로 능력이 없는 임산부는 몇 종에 해당하니?',
'내년부터 저소득층 1세 미만 아동의 \n의료비 부담이 더 낮아진다!\n의료급여제도 개요\n□ (목적) 생활유지 능력이 없거나 생활이 어려운 국민들에게 발생하는 질병, 부상, 출산 등에 대해 국가가 의료서비스 제공\n□ (지원대상) 국민기초생활보장 수급권자, 타 법에 의한 수급권자 등\n\n| 구분 | 국민기초생활보장법에 의한 수급권자 | 국민기초생활보장법 이외의 타 법에 의한 수급권자 |\n| --- | --- | --- |\n| 1종 | ○ 국민기초생활보장 수급권자 중 근로능력이 없는 자만으로 구성된 가구 - 18세 미만, 65세 이상 - 4급 이내 장애인 - 임산부, 병역의무이행자 등 | ○ 이재민(재해구호법) ○ 의상자 및 의사자의 유족○ 국내 입양된 18세 미만 아동○ 국가유공자 및 그 유족․가족○ 국가무형문화재 보유자 및 그 가족○ 새터민(북한이탈주민)과 그 가족○ 5․18 민주화운동 관련자 및 그 유가족○ 노숙인 ※ 행려환자 (의료급여법 시행령) |\n| 2종 | ○ 국민기초생활보장 수급권자 중 근로능력이 있는 가구 | - |\n',
'이어 이날 오후 1시30분부터 열릴 예정이던 스노보드 여자 슬로프스타일 예선 경기는 연기를 거듭하다 취소됐다. 조직위는 예선 없이 다음 날 결선에서 참가자 27명이 한번에 경기해 순위를 가리기로 했다.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
This is a benchmark of Korean embedding models. (https://github.com/Marker-Inc-Korea/AutoRAG-example-korean-embedding-benchmark)
Model name | F1 | Recall | Precision | mAP | mRR | NDCG |
---|---|---|---|---|---|---|
paraphrase-multilingual-mpnet-base-v2 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 |
KoSimCSE-roberta | 0.4298 | 0.4298 | 0.4298 | 0.4298 | 0.4298 | 0.4298 |
Cohere embed-multilingual-v3.0 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 |
openai ada 002 | 0.4737 | 0.4737 | 0.4737 | 0.4737 | 0.4737 | 0.4737 |
multilingual-e5-large-instruct | 0.4649 | 0.4649 | 0.4649 | 0.4649 | 0.4649 | 0.4649 |
Upstage Embedding | 0.6579 | 0.6579 | 0.6579 | 0.6579 | 0.6579 | 0.6579 |
paraphrase-multilingual-MiniLM-L12-v2 | 0.2982 | 0.2982 | 0.2982 | 0.2982 | 0.2982 | 0.2982 |
openai_embed_3_small | 0.5439 | 0.5439 | 0.5439 | 0.5439 | 0.5439 | 0.5439 |
ko-sroberta-multitask | 0.4211 | 0.4211 | 0.4211 | 0.4211 | 0.4211 | 0.4211 |
openai_embed_3_large | 0.6053 | 0.6053 | 0.6053 | 0.6053 | 0.6053 | 0.6053 |
KU-HIAI-ONTHEIT-large-v1 | 0.7105 | 0.7105 | 0.7105 | 0.7105 | 0.7105 | 0.7105 |
KU-HIAI-ONTHEIT-large-v1.1 | 0.7193 | 0.7193 | 0.7193 | 0.7193 | 0.7193 | 0.7193 |
kf-deberta-multitask | 0.4561 | 0.4561 | 0.4561 | 0.4561 | 0.4561 | 0.4561 |
gte-multilingual-base | 0.5877 | 0.5877 | 0.5877 | 0.5877 | 0.5877 | 0.5877 |
KoE5 | 0.7018 | 0.7018 | 0.7018 | 0.7018 | 0.7018 | 0.7018 |
BGE-m3 | 0.6578 | 0.6578 | 0.6578 | 0.6578 | 0.6578 | 0.6578 |
bge-m3-korean | 0.5351 | 0.5351 | 0.5351 | 0.5351 | 0.5351 | 0.5351 |
BGE-m3-ko | 0.7456 | 0.7456 | 0.7456 | 0.7456 | 0.7456 | 0.7456 |
Model name | F1 | Recall | Precision | mAP | mRR | NDCG |
---|---|---|---|---|---|---|
paraphrase-multilingual-mpnet-base-v2 | 0.2368 | 0.4737 | 0.1579 | 0.2032 | 0.2032 | 0.2712 |
KoSimCSE-roberta | 0.3026 | 0.6053 | 0.2018 | 0.2661 | 0.2661 | 0.3515 |
Cohere embed-multilingual-v3.0 | 0.2851 | 0.5702 | 0.1901 | 0.2515 | 0.2515 | 0.3321 |
openai ada 002 | 0.3553 | 0.7105 | 0.2368 | 0.3202 | 0.3202 | 0.4186 |
multilingual-e5-large-instruct | 0.3333 | 0.6667 | 0.2222 | 0.2909 | 0.2909 | 0.3856 |
Upstage Embedding | 0.4211 | 0.8421 | 0.2807 | 0.3509 | 0.3509 | 0.4743 |
paraphrase-multilingual-MiniLM-L12-v2 | 0.2061 | 0.4123 | 0.1374 | 0.1740 | 0.1740 | 0.2340 |
openai_embed_3_small | 0.3640 | 0.7281 | 0.2427 | 0.3026 | 0.3026 | 0.4097 |
ko-sroberta-multitask | 0.2939 | 0.5877 | 0.1959 | 0.2500 | 0.2500 | 0.3351 |
openai_embed_3_large | 0.3947 | 0.7895 | 0.2632 | 0.3348 | 0.3348 | 0.4491 |
KU-HIAI-ONTHEIT-large-v1 | 0.4386 | 0.8772 | 0.2924 | 0.3421 | 0.3421 | 0.4766 |
KU-HIAI-ONTHEIT-large-v1.1 | 0.4430 | 0.8860 | 0.2953 | 0.3406 | 0.3406 | 0.4778 |
kf-deberta-multitask | 0.3158 | 0.6316 | 0.2105 | 0.2792 | 0.2792 | 0.3679 |
gte-multilingual-base | 0.4035 | 0.8070 | 0.2690 | 0.3450 | 0.3450 | 0.4614 |
KoE5 | 0.4254 | 0.8509 | 0.2836 | 0.3173 | 0.3173 | 0.4514 |
BGE-m3 | 0.4254 | 0.8508 | 0.2836 | 0.3421 | 0.3421 | 0.4701 |
bge-m3-korean | 0.3684 | 0.7368 | 0.2456 | 0.3143 | 0.3143 | 0.4207 |
BGE-m3-ko | 0.4517 | 0.9035 | 0.3011 | 0.3494 | 0.3494 | 0.4886 |
Model name | F1 | Recall | Precision | mAP | mRR | NDCG |
---|---|---|---|---|---|---|
paraphrase-multilingual-mpnet-base-v2 | 0.1813 | 0.5439 | 0.1088 | 0.1575 | 0.1575 | 0.2491 |
KoSimCSE-roberta | 0.2164 | 0.6491 | 0.1298 | 0.1751 | 0.1751 | 0.2873 |
Cohere embed-multilingual-v3.0 | 0.2076 | 0.6228 | 0.1246 | 0.1640 | 0.1640 | 0.2731 |
openai ada 002 | 0.2602 | 0.7807 | 0.1561 | 0.2139 | 0.2139 | 0.3486 |
multilingual-e5-large-instruct | 0.2544 | 0.7632 | 0.1526 | 0.2194 | 0.2194 | 0.3487 |
Upstage Embedding | 0.2982 | 0.8947 | 0.1789 | 0.2237 | 0.2237 | 0.3822 |
paraphrase-multilingual-MiniLM-L12-v2 | 0.1637 | 0.4912 | 0.0982 | 0.1437 | 0.1437 | 0.2264 |
openai_embed_3_small | 0.2690 | 0.8070 | 0.1614 | 0.2148 | 0.2148 | 0.3553 |
ko-sroberta-multitask | 0.2164 | 0.6491 | 0.1298 | 0.1697 | 0.1697 | 0.2835 |
openai_embed_3_large | 0.2807 | 0.8421 | 0.1684 | 0.2088 | 0.2088 | 0.3586 |
KU-HIAI-ONTHEIT-large-v1 | 0.3041 | 0.9123 | 0.1825 | 0.2137 | 0.2137 | 0.3783 |
KU-HIAI-ONTHEIT-large-v1.1 | 0.3099 | 0.9298 | 0.1860 | 0.2148 | 0.2148 | 0.3834 |
kf-deberta-multitask | 0.2281 | 0.6842 | 0.1368 | 0.1724 | 0.1724 | 0.2939 |
gte-multilingual-base | 0.2865 | 0.8596 | 0.1719 | 0.2096 | 0.2096 | 0.3637 |
KoE5 | 0.2982 | 0.8947 | 0.1789 | 0.2054 | 0.2054 | 0.3678 |
BGE-m3 | 0.3041 | 0.9123 | 0.1825 | 0.2193 | 0.2193 | 0.3832 |
bge-m3-korean | 0.2661 | 0.7982 | 0.1596 | 0.2116 | 0.2116 | 0.3504 |
BGE-m3-ko | 0.3099 | 0.9298 | 0.1860 | 0.2098 | 0.2098 | 0.3793 |
Model name | F1 | Recall | Precision | mAP | mRR | NDCG |
---|---|---|---|---|---|---|
paraphrase-multilingual-mpnet-base-v2 | 0.1212 | 0.6667 | 0.0667 | 0.1197 | 0.1197 | 0.2382 |
KoSimCSE-roberta | 0.1324 | 0.7281 | 0.0728 | 0.1080 | 0.1080 | 0.2411 |
Cohere embed-multilingual-v3.0 | 0.1324 | 0.7281 | 0.0728 | 0.1150 | 0.1150 | 0.2473 |
openai ada 002 | 0.1563 | 0.8596 | 0.0860 | 0.1051 | 0.1051 | 0.2673 |
multilingual-e5-large-instruct | 0.1483 | 0.8158 | 0.0816 | 0.0980 | 0.0980 | 0.2520 |
Upstage Embedding | 0.1707 | 0.9386 | 0.0939 | 0.1078 | 0.1078 | 0.2848 |
paraphrase-multilingual-MiniLM-L12-v2 | 0.1053 | 0.5789 | 0.0579 | 0.0961 | 0.0961 | 0.2006 |
openai_embed_3_small | 0.1547 | 0.8509 | 0.0851 | 0.0984 | 0.0984 | 0.2593 |
ko-sroberta-multitask | 0.1276 | 0.7018 | 0.0702 | 0.0986 | 0.0986 | 0.2275 |
openai_embed_3_large | 0.1643 | 0.9035 | 0.0904 | 0.1180 | 0.1180 | 0.2855 |
KU-HIAI-ONTHEIT-large-v1 | 0.1707 | 0.9386 | 0.0939 | 0.1105 | 0.1105 | 0.2860 |
KU-HIAI-ONTHEIT-large-v1.1 | 0.1722 | 0.9474 | 0.0947 | 0.1033 | 0.1033 | 0.2822 |
kf-deberta-multitask | 0.1388 | 0.7632 | 0.0763 | 0.1 | 0.1 | 0.2422 |
gte-multilingual-base | 0.1675 | 0.9211 | 0.0921 | 0.1066 | 0.1066 | 0.2805 |
KoE5 | 0.1675 | 0.9211 | 0.0921 | 0.1011 | 0.1011 | 0.2750 |
BGE-m3 | 0.1707 | 0.9386 | 0.0939 | 0.1130 | 0.1130 | 0.2884 |
bge-m3-korean | 0.1579 | 0.8684 | 0.0868 | 0.1093 | 0.1093 | 0.2721 |
BGE-m3-ko | 0.1770 | 0.9736 | 0.0974 | 0.1097 | 0.1097 | 0.2932 |
miracl-ko
(https://github.com/project-miracl/miracl)InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.6103 |
cosine_accuracy@3 | 0.8169 |
cosine_accuracy@5 | 0.8732 |
cosine_accuracy@10 | 0.9202 |
cosine_precision@1 | 0.6103 |
cosine_precision@3 | 0.3787 |
cosine_precision@5 | 0.2761 |
cosine_precision@10 | 0.1728 |
cosine_recall@1 | 0.3847 |
cosine_recall@3 | 0.5902 |
cosine_recall@5 | 0.6794 |
cosine_recall@10 | 0.7695 |
cosine_ndcg@10 | 0.6833 |
cosine_mrr@10 | 0.7262 |
cosine_map@100 | 0.6074 |
dot_accuracy@1 | 0.6103 |
dot_accuracy@3 | 0.8169 |
dot_accuracy@5 | 0.8732 |
dot_accuracy@10 | 0.9202 |
dot_precision@1 | 0.6103 |
dot_precision@3 | 0.3787 |
dot_precision@5 | 0.2761 |
dot_precision@10 | 0.1728 |
dot_recall@1 | 0.3847 |
dot_recall@3 | 0.5902 |
dot_recall@5 | 0.6794 |
dot_recall@10 | 0.7695 |
dot_ndcg@10 | 0.6723 |
dot_mrr@10 | 0.7262 |
dot_map@100 | 0.6074 |
The batch size was referenced from the following paper: Text Embeddings by Weakly-Supervised Contrastive Pre-training (https://arxiv.org/pdf/2212.03533)
eval_strategy
: stepsper_device_train_batch_size
: 32768per_device_eval_batch_size
: 32768learning_rate
: 3e-05warmup_ratio
: 0.03333333333333333fp16
: Truebatch_sampler
: no_duplicatesoverwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 32768per_device_eval_batch_size
: 32768per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonelearning_rate
: 3e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 3max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.03333333333333333warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Truedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
: auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falsebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportional@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{bge-m3,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
year={2024},
eprint={2402.03216},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{wang2022text,
title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2212.03533},
year={2022}
}