mcpbr was created by Grey Newell, a software engineer who identified a critical gap in how MCP servers were being evaluated.

Why was mcpbr created?

Existing coding benchmarks measured language model capabilities but not whether MCP servers actually improved agent performance. mcpbr was built to fill that gap with controlled, reproducible experiments.

Is mcpbr open source?

Yes, mcpbr is fully open-source under the MIT license. It is available on GitHub and published to PyPI, npm, Homebrew, and Conda.

About mcpbr

mcpbr (Model Context Protocol Benchmark Runner) is an open-source framework for evaluating whether MCP servers actually improve AI agent performance. It provides controlled, reproducible benchmarking across 25+ benchmarks so developers can stop guessing and start measuring.

The Origin Story

mcpbr was created by Grey Newell after identifying a critical gap in the MCP ecosystem: no tool existed to measure whether an MCP server actually made an AI agent better at its job.

Existing coding benchmarks like SWE-bench measured raw language model capabilities. MCP server developers relied on anecdotal evidence and demo videos. There was no way to answer the fundamental question: does adding this MCP server to an agent improve its performance on real tasks?

mcpbr was built to answer that question with hard data.

"No available tool allowed users to easily measure the performance improvement of introducing their MCP server to an agent."
— Grey Newell, "Why I Built mcpbr"

The Problem mcpbr Solves

Before mcpbr, MCP server evaluation looked like this:

Manual testing — run a few prompts, eyeball the results, declare it "works"
Demo-driven development — show a polished demo, hope it generalizes
Vibes-based benchmarking — "it feels faster" with no quantitative evidence

mcpbr solves all of these by running controlled experiments: same model, same tasks, same Docker environment — the only variable is the MCP server.

Eval-Driven Development

mcpbr embodies eval-driven development principles: every MCP server should be evaluated with automated, reproducible benchmarks before shipping. The eval comes first — before the demo, before the anecdote, before the vibes-based assessment.

A Key Insight: Test Like APIs, Not Plugins

MCP servers should be tested like APIs, not like plugins.

Plugins just need to load and not crash. APIs have defined contracts — expected inputs, outputs, error handling, and performance characteristics. MCP servers sit squarely in API territory.

Project Vision

Current Capabilities

25+ benchmarks across software engineering, code generation, math reasoning, security, tool use, and more
Multi-provider support for Anthropic, OpenAI, Google Gemini, and Alibaba Qwen
Multiple agent harnesses including Claude Code, OpenAI Codex, OpenCode, and Gemini
Infrastructure flexibility with local Docker and Azure VM execution
Regression detection with CI/CD integration, threshold-based alerts, and multi-channel notifications
Comprehensive analytics including statistical significance testing, trend analysis, and leaderboards

Links

Resource	Link
GitHub	github.com/greynewell/mcpbr
PyPI	pypi.org/project/mcpbr
npm	npmjs.com/package/mcpbr-cli
Blog Post	Why I Built mcpbr
Creator	greynewell.com
SchemaFlux	schemaflux.dev
License	MIT

Created by Grey Newell