Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
Abstract
A novel framework uses Large Language Models to evaluate podcast recommendations by constructing user profiles and providing context for the LLM to make judgments, improving efficiency and interpretability.
Evaluating personalized recommendations remains a central challenge, especially in long-form audio domains like podcasts, where traditional offline metrics suffer from exposure bias and online methods such as A/B testing are costly and operationally constrained. In this paper, we propose a novel framework that leverages Large Language Models (LLMs) as offline judges to assess the quality of podcast recommendations in a scalable and interpretable manner. Our two-stage profile-aware approach first constructs natural-language user profiles distilled from 90 days of listening history. These profiles summarize both topical interests and behavioral patterns, serving as compact, interpretable representations of user preferences. Rather than prompting the LLM with raw data, we use these profiles to provide high-level, semantically rich context-enabling the LLM to reason more effectively about alignment between a user's interests and recommended episodes. This reduces input complexity and improves interpretability. The LLM is then prompted to deliver fine-grained pointwise and pairwise judgments based on the profile-episode match. In a controlled study with 47 participants, our profile-aware judge matched human judgments with high fidelity and outperformed or matched a variant using raw listening histories. The framework enables efficient, profile-aware evaluation for iterative testing and model selection in recommender systems.
Community
Thrilled to be able to share our research on “Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge”, which got accepted at RecSys ’25 LBR track.
📌 The challenge: Evaluating recommendations is tricky — offline metrics are biased, online tests are costly, and human evaluation doesn’t scale.
🧠 Our solution: We use LLMs as interpretable, profile-aware offline judges. We distill 90 days of user behavior into natural-language profiles summarizing interests, habits, and styles. Then we prompt the LLM to judge how well recommendations match their interests.
✅ Key takeaway: Profile-aware LLM judges match or exceed raw-history approaches in aligning with human judgments.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LettinGo: Explore User Profile Generation for Recommendation System (2025)
- End-to-End Personalization: Unifying Recommender Systems with Large Language Models (2025)
- Exploration on Demand: From Algorithmic Control to User Empowerment (2025)
- RecGPT Technical Report (2025)
- Dynamic Context-Aware Prompt Recommendation for Domain-Specific AI Applications (2025)
- Using LLMs to Capture Users'Temporal Context for Recommendation (2025)
- ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper