TIGER-Lab
/

AceCodeRM-32B

@@ -29,6 +29,29 @@ We introduce AceCoder, the first work to propose a fully automated pipeline for
 ![https://tiger-ai-lab.github.io/AceCoder/static/images/ac_overview.png](https://tiger-ai-lab.github.io/AceCoder/static/images/ac_overview.png)
 ## Performance on Best-of-N sampling
@@ -40,11 +63,11 @@ We introduce AceCoder, the first work to propose a fully automated pipeline for
 ```python
 """pip install git+https://github.com/TIGER-AI-Lab/AceCoder"""
-from acecoder import Qwen2ForCausalRM
 from transformers import AutoTokenizer
 model_path = "TIGER-Lab/AceCodeRM-7B"
-model = Qwen2ForCausalRM.from_pretrained(model_path, device_map="auto")
 tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
 question = """\
@@ -114,17 +137,12 @@ input_tokens = tokenizer.apply_chat_template(
     return_tensors="pt",
 ).to(model.device)
-_, _, values = model(
     **input_tokens,
     output_hidden_states=True,
     return_dict=True,
     use_cache=False,
 )
-masks = input_tokens["attention_mask"]
-rm_scores = values.gather(
-    dim=-1, index=(masks.sum(dim=-1, keepdim=True) - 1)
-) # find the last token (eos) in each sequence, a
-rm_scores = rm_scores.squeeze()
 print("RM Scores:", rm_scores)
 print("Score of program with 3 errors:", rm_scores[0].item())

 ![https://tiger-ai-lab.github.io/AceCoder/static/images/ac_overview.png](https://tiger-ai-lab.github.io/AceCoder/static/images/ac_overview.png)
+## Performance on RM Bench
+| Model                                | Code | Chat  | Math  | Safety | Easy  | Normal | Hard | Avg  |
+| ------------------------------------ | ---- | ----- | ----- | ------ | ----- | ------ | ---- | ---- |
+| Skywork/Skywork-Reward-Llama-3.1-8B  | 54.5 | 69.5  | 60.6  | 95.7   | **89**    | 74.7   | 46.6 | 70.1 |
+| LxzGordon/URM-LLaMa-3.1-8B           | 54.1 | 71.2  | 61.8  | 93.1   | 84    | 73.2   | 53   | 70   |
+| NVIDIA/Nemotron-340B-Reward          | 59.4 | 71.2  | 59.8  | 87.5   | 81    | 71.4   | 56.1 | 69.5 |
+| NCSOFT/Llama-3-OffsetBias-RM-8B      | 53.2 | 71.3  | 61.9  | 89.6   | 84.6  | 72.2   | 50.2 | 69   |
+| internlm/internlm2-20b-reward        | 56.7 | 63.1  | 66.8  | 86.5   | 82.6  | 71.6   | 50.7 | 68.3 |
+| Ray2333/GRM-llama3-8B-sftreg         | 57.8 | 62.7  | 62.5  | 90     | 83.5  | 72.7   | 48.6 | 68.2 |
+| Ray2333/GRM-llama3-8B-distill        | 56.9 | 62.4  | 62.1  | 88.1   | 82.2  | 71.5   | 48.4 | 67.4 |
+| Ray2333/GRM-Llama3-8B-rewardmodel-ft | 52.1 | 66.8  | 58.8  | 91.4   | 86.2  | 70.6   | 45.1 | 67.3 |
+| LxzGordon/URM-LLLaMa-3-8B            | 52.3 | 68.5  | 57.6  | 90.3   | 80.2  | 69.9   | 51.5 | 67.2 |
+| internlm/internlm2-7b-reward*         | 49.7 | 61.7  | **71.4**  | 85.5   | 85.4  | 70.7   | 45.1 | 67.1 |
+| Skywork-Reward-Llama-3.1-8B-v0.2*     | 53.4 | 69.2  | 62.1  | **96**     | 88.5  | 74     | 47.9 | 70.1 |
+| Skywork-Reward-Gemma-2-27B-v0.2*      | 45.8 | 49.4  | 50.7  | 48.2   | 50.3  | 48.2   | 47   | 48.5 |
+| AceCoder-RM-7B                       | 66.9 | 66.7  | 65.3  | 89.9   | 79.9  | 74.4   | 62.2 | 72.2 |
+| AceCoder-RM-32B                      | **72.1** | **73.7**  | 70.5  | 88     | 84.5  | **78.3**   | **65.5** | **76.1** |
+| Delta (AceCoder 7B - Others)         | 7.5  | \-4.6 | \-6.1 | \-6.1  | \-9.1 | \-0.3  | 6.1  | 2.1  |
+| Delta (AceCoder 32B - Others)        | 12.7 | 2.4   | \-0.9 | \-8    | \-4.5 | 3.6    | 9.4  | 6    |
+\* These models do not have official results as they are released later than the RM Bench paper; therefore, the authors tried our best to extend the original code base to test these models. Our implementation can be found here:
+[Modified Reward Bench / RM Bench Code](https://github.com/wyettzeng/reward-bench)
 ## Performance on Best-of-N sampling
 ```python
 """pip install git+https://github.com/TIGER-AI-Lab/AceCoder"""
+from acecoder import AceCodeRM
 from transformers import AutoTokenizer
 model_path = "TIGER-Lab/AceCodeRM-7B"
+model = AceCodeRM.from_pretrained(model_path, device_map="auto")
 tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
 question = """\
     return_tensors="pt",
 ).to(model.device)
+rm_scores = model(
     **input_tokens,
     output_hidden_states=True,
     return_dict=True,
     use_cache=False,
 )
 print("RM Scores:", rm_scores)
 print("Score of program with 3 errors:", rm_scores[0].item())