Abstract
With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.
Community
Recent developments in reasoning models have significantly advanced various applications, yet challenges persist in answer extraction and validation, particularly in long reasoning chains and complex mathematical expressions. To address these issues, the Shanghai Algorithm Innovation Institute introduces xVerify, an innovative answer verification tool with the following highlights:
Optimized for Long Reasoning Chains
xVerify is trained on extensive datasets containing long reasoning responses, enabling it to effectively handle interference from intermediate steps and self-reflection segments.
Broad Applicability
The tool supports diverse question types—including mathematical, multiple-choice, classification, and short-answer formats—and offers bilingual (Chinese and English) capabilities, ensuring robust performance in various evaluative contexts.
Advanced Answer Verify
xVerify excels in verifying answers by efficiently handling transformations such as case conversion and Greek letter substitutions (e.g., converting “alpha” to “α”). It accurately determines the equivalence of complex mathematical expressions presented in various formats—including LaTeX, fractions, scientific notation, and natural language—ensuring that even variably formatted inputs are reliably matched and validated.
Flexible Model Configurations
Incorporating multiple model architectures (e.g., Qwen 2.5, Gemma 2, LLaMA 3.1/3.2, GLM 4, Phi-4) with parameter sizes ranging from 0.5B to 32B, xVerify allows users to mitigate inherent biases and select optimal configurations based on their computational needs.
In summary, xVerify directly tackles the challenge of verifying long reasoning chain responses and, by achieving evaluation accuracies exceeding 96% in most cases, sets a new benchmark in answer verification.
Relevant links:
Hugging Face[Paper]: https://huggingface.co/papers/2504.10481
Hugging Face[Model]:https://huggingface.co/collections/IAAR-Shanghai/xverify-67e0f6f94c2dc334727da802
arXiv: https://arxiv.org/abs/2504.10481
GitHub: https://github.com/IAAR-Shanghai/xVerify
We would greatly appreciate it if you could give us a like or share on Hugging Face!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation (2025)
- MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification (2025)
- Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering (2025)
- Theorem Prover as a Judge for Synthetic Data Generation (2025)
- DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities (2025)
- Towards Reasoning Ability of Small Language Models (2025)
- Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper