MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Abstract
MaPPO, a framework for preference optimization, enhances alignment of large language models with human preferences by integrating prior reward knowledge into a Maximum a Posteriori objective, improving performance across various benchmarks.
As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization (2025)
- Robust Preference Optimization via Dynamic Target Margins (2025)
- Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling (2025)
- Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization (2025)
- Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model (2025)
- Stable Preference Optimization for LLMs: A Bilevel Approach Beyond Direct Preference Optimization (2025)
- Adaptive Sample Scheduling for Direct Preference Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper