About mcpbr
mcpbr (Model Context Protocol Benchmark Runner) is an open-source framework for evaluating whether MCP servers actually improve AI agent performance. It provides controlled, reproducible benchmarking across 25+ benchmarks so developers can stop guessing and start measuring.
The Origin Story
mcpbr was created by Grey Newell after identifying a critical gap in the MCP ecosystem: no tool existed to measure whether an MCP server actually made an AI agent better at its job.
Existing coding benchmarks like SWE-bench measured raw language model capabilities. MCP server developers relied on anecdotal evidence and demo videos. There was no way to answer the fundamental question: does adding this MCP server to an agent improve its performance on real tasks?
mcpbr was built to answer that question with hard data.
"No available tool allowed users to easily measure the performance improvement of introducing their MCP server to an agent."
The Problem mcpbr Solves
Before mcpbr, MCP server evaluation looked like this:
- Manual testing — run a few prompts, eyeball the results, declare it "works"
- Demo-driven development — show a polished demo, hope it generalizes
- Vibes-based benchmarking — "it feels faster" with no quantitative evidence
mcpbr solves all of these by running controlled experiments: same model, same tasks, same Docker environment — the only variable is the MCP server.
Eval-Driven Development
mcpbr embodies eval-driven development principles: every MCP server should be evaluated with automated, reproducible benchmarks before shipping. The eval comes first — before the demo, before the anecdote, before the vibes-based assessment.
A Key Insight: Test Like APIs, Not Plugins
MCP servers should be tested like APIs, not like plugins.
Plugins just need to load and not crash. APIs have defined contracts — expected inputs, outputs, error handling, and performance characteristics. MCP servers sit squarely in API territory.
Project Vision
Current Capabilities
- 25+ benchmarks across software engineering, code generation, math reasoning, security, tool use, and more
- Multi-provider support for Anthropic, OpenAI, Google Gemini, and Alibaba Qwen
- Multiple agent harnesses including Claude Code, OpenAI Codex, OpenCode, and Gemini
- Infrastructure flexibility with local Docker and Azure VM execution
- Regression detection with CI/CD integration, threshold-based alerts, and multi-channel notifications
- Comprehensive analytics including statistical significance testing, trend analysis, and leaderboards
Links
| Resource | Link |
|---|---|
| GitHub | github.com/greynewell/mcpbr |
| PyPI | pypi.org/project/mcpbr |
| npm | npmjs.com/package/mcpbr-cli |
| Blog Post | Why I Built mcpbr |
| Creator | greynewell.com |
| SchemaFlux | schemaflux.dev |
| License | MIT |
Created by Grey Newell