Papers
arxiv:2504.10481

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Published on Apr 14
· Submitted by Hush-cd on Apr 16
#1 Paper of the day
Authors:
,
,
,
,
,

Abstract

With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.

Community

Paper author Paper submitter

Recent developments in reasoning models have significantly advanced various applications, yet challenges persist in answer extraction and validation, particularly in long reasoning chains and complex mathematical expressions. To address these issues, the Shanghai Algorithm Innovation Institute introduces xVerify, an innovative answer verification tool with the following highlights:

Optimized for Long Reasoning Chains
xVerify is trained on extensive datasets containing long reasoning responses, enabling it to effectively handle interference from intermediate steps and self-reflection segments.

Broad Applicability
The tool supports diverse question types—including mathematical, multiple-choice, classification, and short-answer formats—and offers bilingual (Chinese and English) capabilities, ensuring robust performance in various evaluative contexts.

Advanced Answer Verify
xVerify excels in verifying answers by efficiently handling transformations such as case conversion and Greek letter substitutions (e.g., converting “alpha” to “α”). It accurately determines the equivalence of complex mathematical expressions presented in various formats—including LaTeX, fractions, scientific notation, and natural language—ensuring that even variably formatted inputs are reliably matched and validated.

Flexible Model Configurations
Incorporating multiple model architectures (e.g., Qwen 2.5, Gemma 2, LLaMA 3.1/3.2, GLM 4, Phi-4) with parameter sizes ranging from 0.5B to 32B, xVerify allows users to mitigate inherent biases and select optimal configurations based on their computational needs.

In summary, xVerify directly tackles the challenge of verifying long reasoning chain responses and, by achieving evaluation accuracies exceeding 96% in most cases, sets a new benchmark in answer verification.

Relevant links:

Hugging Face[Paper]: https://huggingface.co/papers/2504.10481
Hugging Face[Model]:https://huggingface.co/collections/IAAR-Shanghai/xverify-67e0f6f94c2dc334727da802
arXiv: https://arxiv.org/abs/2504.10481
GitHub: https://github.com/IAAR-Shanghai/xVerify

We would greatly appreciate it if you could give us a like or share on Hugging Face!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.10481 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.10481 in a Space README.md to link it from this page.

Collections including this paper 9