GenPRM
/

GenPRM-7B

Safetensors

English

qwen2

Model card Files Files and versions Community

RyanLiu112 commited on Apr 5

Commit

947c96c

verified ·

1 Parent(s): 7cc41b4

Update README.md

Browse files

Files changed (1) hide show

README.md +21 -8

README.md CHANGED Viewed

@@ -3,32 +3,45 @@ license: mit
 datasets:
 - GenPRM/GenPRM-MATH-Data
 base_model:
-- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
 ---
 # Introduction
-We introduce GenPRM, a generative process reward model (PRM) designed to enhance process supervision performance through explicit Chain-of-Thought (CoT) reasoning and code verification. Addressing critical limitations of prior PRMs—including limited process supervision and scalability. GenPRM pioneers a novel paradigm that leverages the generative capabilities of LLMs to perform step-wise reasoning validation.
 GenPRM achieves state-of-the-art performance across multiple benchmarks in two key roles:
 - As a verifier: GenPRM-7B outperforms all classification-based PRMs of comparable size and even surpasses Qwen2.5-Math-PRM-72B via test-time scaling.
 - As a critic: GenPRM-7B demonstrates superior critique capabilities, achieving 3.4× greater performance gains than DeepSeekR1-Distill-Qwen-7B after 3 refinement iterations.
 ![](images/fig_head.png)
-- Project Page: [GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning](https://ryanliu112.github.io/GenPRM/)
 - Paper: [https://arxiv.org/abs/2504.00891](https://arxiv.org/abs/2504.00891)
 - Code: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM)
 # Model details
-For full training details, please refer to our [paper](https://arxiv.org/abs/2504.00891)
-- Training data: the 23K conversation data are released in [GenPRM-MATH-Data](https://huggingface.co/datasets/GenPRM/GenPRM-MATH-Data).
-- Base model: we select the [DeepSeek-R1-Distill series](https://huggingface.co/deepseek-ai) (1.5B, 7B, 32B) as our base models
 # How to use
-The evaluation and testing code for GenPRM are available in our GitHub repository:  [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM)
-Here's a minimal example of using VLLM for GenPRM rationale generation:
 ```python
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams

 datasets:
 - GenPRM/GenPRM-MATH-Data
 base_model:
+- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
 ---
 # Introduction
+We propose GenPRM, a strong generative process reward model with the following features:
+- reasoning with explicit CoT and code verfication before providing the process judgment;
+- improving Monte Carlo estimation and hard label with Relative Progress Estimation (RPE);
+- supporting GenPRM test-time scaling in a parallel manner with majority voting;
+- supporting policy model test-time scaling with GenPRM as verifiers or critics.
 GenPRM achieves state-of-the-art performance across multiple benchmarks in two key roles:
 - As a verifier: GenPRM-7B outperforms all classification-based PRMs of comparable size and even surpasses Qwen2.5-Math-PRM-72B via test-time scaling.
 - As a critic: GenPRM-7B demonstrates superior critique capabilities, achieving 3.4× greater performance gains than DeepSeekR1-Distill-Qwen-7B after 3 refinement iterations.
 ![](images/fig_head.png)
+- Project Page: [GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning](https://ryanliu112.github.io/GenPRM)
 - Paper: [https://arxiv.org/abs/2504.00891](https://arxiv.org/abs/2504.00891)
 - Code: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM)
+- Awesome Process Reward Models: [Awesome Process Reward Models](https://github.com/RyanLiu112/Awesome-Process-Reward-Models)
+- HF Paper Link：[GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning](https://hf.co/papers/2504.00891)
+- HF Collection：[GenPRM](https://hf.co/collections/GenPRM/genprm-67ee4936234ba5dd16bb9943)
 # Model details
+For full training details, please refer to our [paper](https://arxiv.org/abs/2504.00891).
+- Training data: 23K SFT data is released in [GenPRM-MATH-Data](https://huggingface.co/datasets/GenPRM/GenPRM-MATH-Data).
+- Base model: we use [DeepSeek-R1-Distill series](https://huggingface.co/deepseek-ai) (1.5B, 7B, and 32B) as our base models.
 # How to use
+The evaluation code of GenPRM is available in our GitHub repository: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM).
+Here's a minimal example of using GenPRM for rationale generation and process supervision:
 ```python
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams