m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models Paper • 2504.00869 • Published 22 days ago • 10
ViLBench: A Suite for Vision-Language Process Reward Modeling Paper • 2503.20271 • Published 28 days ago • 7
IHEval: Evaluating Language Models on Following the Instruction Hierarchy Paper • 2502.08745 • Published Feb 12 • 19