arxiv:2504.10481

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Published on Apr 14

· Submitted by

Hush-cd on Apr 16

#1 Paper of the day

Upvote

Authors:

Ding Chen ,

Qingchen Yu ,

Pengyuan Wang ,

Abstract

With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.

View arXiv page View PDF GitHub repository Add to collection

Community

Hush-cd

Paper author Paper submitter 4 days ago

Recent developments in reasoning models have significantly advanced various applications, yet challenges persist in answer extraction and validation, particularly in long reasoning chains and complex mathematical expressions. To address these issues, the Shanghai Algorithm Innovation Institute introduces xVerify, an innovative answer verification tool with the following highlights:

Optimized for Long Reasoning Chains
xVerify is trained on extensive datasets containing long reasoning responses, enabling it to effectively handle interference from intermediate steps and self-reflection segments.

Broad Applicability
The tool supports diverse question types—including mathematical, multiple-choice, classification, and short-answer formats—and offers bilingual (Chinese and English) capabilities, ensuring robust performance in various evaluative contexts.

Advanced Answer Verify
xVerify excels in verifying answers by efficiently handling transformations such as case conversion and Greek letter substitutions (e.g., converting “alpha” to “α”). It accurately determines the equivalence of complex mathematical expressions presented in various formats—including LaTeX, fractions, scientific notation, and natural language—ensuring that even variably formatted inputs are reliably matched and validated.

Flexible Model Configurations
Incorporating multiple model architectures (e.g., Qwen 2.5, Gemma 2, LLaMA 3.1/3.2, GLM 4, Phi-4) with parameter sizes ranging from 0.5B to 32B, xVerify allows users to mitigate inherent biases and select optimal configurations based on their computational needs.

In summary, xVerify directly tackles the challenge of verifying long reasoning chain responses and, by achieving evaluation accuracies exceeding 96% in most cases, sets a new benchmark in answer verification.

Relevant links:

Hugging Face[Paper]: https://huggingface.co/papers/2504.10481
Hugging Face[Model]:https://huggingface.co/collections/IAAR-Shanghai/xverify-67e0f6f94c2dc334727da802
arXiv: https://arxiv.org/abs/2504.10481
GitHub: https://github.com/IAAR-Shanghai/xVerify

We would greatly appreciate it if you could give us a like or share on Hugging Face!