₩ON: Open LLM for Korean Finance

Introduction

₩ON is an advanced Large Language Model (LLM) specifically tailored for financial tasks in the Korean domain. ₩ON is designed to enhance reliability and transparency in financial AI applications. The core intent behind ₩ON is to promote research openness, benchmark rigorous financial reasoning capabilities, and foster best practices in training Korean-specific financial language models. The model notably incorporates a two-step structured reasoning approach, providing self-correcting reasoning followed by a conclusive summary, aiming to elevate clarity and accuracy in financial decision-making processes.

KRX Financial LLM Competition

Competition Overview

The competition was the first open leaderboard dedicated to evaluating large language models specifically for Korean financial tasks. It was conducted over two months, including preliminary and final rounds, attracting 233 registered teams who collectively submitted over 1,100 models. The preliminary round included evaluations across five categories (Financial Markets, Finance and Accounting, Domestic Company Analysis, Financial Agent Tasks, and Stock Price Prediction), while the final round concentrated on Finance and Accounting, Financial Markets, and Open-Ended Finance QA.

Benchmark Description

The benchmark used during the competition consisted of approximately 5,500 carefully curated MCQA and Instruction-Response questions across various financial domains:

Finance and Accounting: Evaluated via university-level multiple-choice questions on accounting and financial principles.
Financial Markets: Based on examinations assessing understanding of financial regulations and Korean market systems.
Stock Price Prediction: Involved binary prediction tasks based on recent stock price data and computed indicators.
Domestic Company Analysis: Utilized KRX-Bench data generated from Korean company filings.
Financial Agents: Tasked models with executing financial data manipulations and coding tasks.
Open-Ended FinQA: Comprised of complex graduate-level econometric and legal reasoning tasks.

An example of benchmark dataset is the following:

샘플 이미지 — Overview of the benchmark used for evaluation. Each example demonstrates a specific question type for each category.

Benchmark Competition Statistics

The competition saw broad participation, with 52.5% corporate teams from sectors such as Tech and Finance, and significant academic involvement, reflecting diverse stakeholder interest in Korean financial NLP.

Competition Results Analysis

During the preliminary rounds, top-performing models primarily utilized supervised fine-tuning (SFT), yielding notable gains particularly in the Domestic Company Analysis category. Despite substantial improvements in this area, advancements in Financial & Accounting and Financial Markets were comparatively modest. Most models adopted straightforward SFT approaches; however, some teams experimented with additional training methods, such as continual pre-training (CPT), although its impact at smaller scales remained inconclusive.

In the final rounds, advanced multi-step training methodologies became prevalent. Notably, teams implemented curriculum-based SFT strategies, beginning with simpler prompts and progressing towards more challenging instances generated using methods such as Evolve Instruct. The best-performing models further refined their capabilities through preference optimization techniques such as Direct Preference Optimization (DPO) and KTO, utilizing responses evaluated by LLM-as-a-Judge methodologies. Team Hi-Q specifically demonstrated the effectiveness of continual pre-training combined with SFT and DPO, achieving substantial performance improvements, thereby highlighting the value of structured and multi-stage training processes.

Model Training

Dataset Collection

We compiled a comprehensive training dataset of roughly 400,000 high-quality instructional samples through meticulous processes:

Competition: A publicly available 80k instruction dataset carefully filtered from more than 200,000 submissions on Hugging Face during the competition, employing MinHash algorithms and regex filtering to ensure the data quality.
Reasoning response: The Responses are generated using Deepseek-R1, complemented by prompt-response pairs gathered from publicly available English and Korean online resources.
Verification: Human verification processes and automated quality checks utilizing GPT-4o as an LLM-as-a-judge, enhancing data integrity and correctness.

Training Methods

We designed a sophisticated two-phase training strategy for ₩ON:

Supervised Fine-Tuning (SFT): This stage focused on aligning the model’s initial behavior with financial reasoning tasks, employing carefully curated prompts paired with comprehensive responses generated by Deepseek-R1. The dataset comprised of prompt-response pairs, meticulously reviewed to ensure linguistic coherence in both Korean and English.
Direct Preference Optimization (DPO): After initial fine-tuning, DPO was utilized to optimize model preferences and reduce unwanted behaviors, particularly addressing the model’s tendency to excessively ponder (overthink) or misinterpret certain queries. The preference data was leveraged for comparing the model’s outputs to Deepseek-R1 to effectively refine model responses.

Model Specifications

Base Model: Qwen2.5-Math-7B-Instruct
Language: Korean, English
Parameters: 7B

₩ON is designed to output structured responses in a two-step reasoning format:

Think Step: The model explicitly demonstrates its reasoning process within <think> and </think> tags. This allows for transparency and helps users understand how ₩ON arrives at its conclusions.
Solution Step: After reasoning, the model succinctly summarizes its conclusions within <solution> and </solution> tags, providing clear and concise answers.

Benchmark Results

We have evaluated ₩ON on the comprehensive benchmark employed in the competition. This benchmark consisted of rigorously designed multiple-choice questions (MCQA) and open-ended questions to thoroughly assess the practical and theoretical capabilities of financial language models. The benchmark is categorized into Finance & Accounting (F&A), Financial Market analysis, and an Open-Ended Financial Question-Answering (FinQA) task:

Finance & Accounting: The benchmark targets to evaluate the model's knowledge and analytical skills in financial concepts, accounting principles, and econometric reasoning.
Financial Market Analysis: It assesses the model's understanding of financial markets, systems, regulations, and domain-specific factual knowledge.
Open-Ended FinQA: It comprises complex and detailed reasoning questions to simulate realistic financial problem-solving scenarios.

Results

₩ON emerged as the highest-performing model on average compared to the models awarded in the competition. The performance shows that the superior ability of ₩ON, particularly in the Finance & Accounting and Open-Ended FinQA subsets, reflecting its strong reasoning capabilities. Despite less emphasis on purely domain-specific knowledge (Market), ₩ON's reasoning strength notably outperformed competing models.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained("KRX-Data/WON-Reasoning", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("KRX-Data/WON-Reasoning")

messages = [
    {"role": "user", "content": <your_promt>} # Replace `<your_prompt>` with your query!
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Citation

@article{son2025won,
  title={Won: Establishing Best Practices for Korean Financial NLP},
  author={Son, Guijin and Ko, Hyunwoo and Jung, Haneral and Hwang, Chami},
  journal={arXiv preprint arXiv:2503.17963},
  year={2025}
}

Contact

spthsrbwls123@yonsei.ac.kr, hcharm2ing@krx.co.kr

KRX-Data
/

WON-Reasoning