LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Abstract
LiveMCP-101 benchmarks AI agents' ability to use multiple tools in real-world scenarios, revealing challenges in tool orchestration and inefficiencies in token usage.
Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.
Community
This paper introduces LiveMCP-101, a real-world multi-step MCP tool-benchmark posing 101 curated queries and a novel evaluation based on execution plans rather than raw outputs—highlighting frontier LLMs’ success rate under 60%. We also provide detailed failure attribution and token efficiency analysis.
awesome paper!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? (2025)
- MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models (2025)
- Help or Hurdle? Rethinking Model Context Protocol-Augmented Large Language Models (2025)
- MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers (2025)
- MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations (2025)
- Agent WARPP: Workflow Adherence via Runtime Parallel Personalization (2025)
- Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper