Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models Paper • 2504.20157 • Published Apr 28 • 38
Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models Paper • 2402.11532 • Published Feb 18, 2024 • 1
Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation Paper • 2404.09127 • Published Apr 14, 2024 • 2