# mcpbr

> Open-source benchmark runner for evaluating MCP servers and AI agents across 25+ benchmarks

## Documentation
- [Installation](https://mcpbr.org/installation.html)
- [Configuration](https://mcpbr.org/configuration.html)
- [CLI Reference](https://mcpbr.org/cli.html)
- [About](https://mcpbr.org/about.html)

## Benchmark Categories

### Code Generation

- [HumanEval: OpenAI Python Programming Benchmark (164 Problems)](https://mcpbr.org/humaneval.html): HumanEval evaluates AI agents on 164 Python programming problems from OpenAI, testing code generation from function signatures and docstrings with unit test verification.

- [MBPP: Mostly Basic Python Programming Problems Benchmark](https://mcpbr.org/mbpp.html): MBPP benchmark for mcpbr - ~1,000 crowd-sourced Python programming problems designed for entry-level programmers.


### Code Understanding

- [RepoQA: Long-Context Code Understanding &amp; Function Search](https://mcpbr.org/repoqa.html): RepoQA evaluates long-context code understanding by testing whether agents can find and identify specific functions within large repository codebases.


### Knowledge &amp; QA

- [ARC: AI2 Reasoning Challenge for Grade-School Science Questions](https://mcpbr.org/arc.html): ARC (AI2 Reasoning Challenge) benchmark for evaluating grade-school science reasoning with multiple-choice questions.

- [GAIA: General AI Assistant Benchmark for Reasoning &amp; Tool Use](https://mcpbr.org/gaia.html): GAIA evaluates general AI assistant capabilities including multi-step reasoning, web browsing, tool use, and multi-modality on real-world questions with unambiguous answers.

- [HellaSwag: Commonsense Reasoning Through Sentence Completion](https://mcpbr.org/hellaswag.html): HellaSwag benchmark for evaluating commonsense reasoning through adversarially filtered sentence completion.

- [TruthfulQA: Evaluating AI Truthfulness &amp; Misconception Resistance](https://mcpbr.org/truthfulqa.html): TruthfulQA benchmark for evaluating truthfulness and resistance to common misconceptions across 38 categories.


### Math &amp; Reasoning

- [BigBench-Hard (BBH): 27 Challenging Reasoning Tasks Beyond Human Baseline](https://mcpbr.org/bigbench-hard.html): BigBench-Hard benchmark for mcpbr - 27 challenging reasoning tasks from BIG-Bench where language models score below average human performance.

- [GSM8K: Grade-School Math Reasoning Benchmark (1,319 Problems)](https://mcpbr.org/gsm8k.html): GSM8K evaluates mathematical reasoning on 1,319 grade-school math word problems, testing chain-of-thought reasoning and numeric answer extraction with tolerance-based comparison.

- [MATH: Competition Mathematics Benchmark (AMC, AIME, 12,500 Problems)](https://mcpbr.org/math.html): MATH benchmark for mcpbr - 12,500 competition mathematics problems from AMC, AIME, and other competitions across 7 subjects and 5 difficulty levels.


### ML Research

- [MLAgentBench: Real ML Research Tasks from Kaggle Competitions](https://mcpbr.org/mlagentbench.html): MLAgentBench evaluates AI agents on real ML research tasks based on Kaggle competitions, testing their ability to train models, improve performance metrics, and debug ML pipelines.


### Security

- [CyberGym: Cybersecurity Exploit Generation Benchmark for AI Agents](https://mcpbr.org/cybergym.html): CyberGym is a cybersecurity benchmark from UC Berkeley where agents generate Proof-of-Concept exploits for real C/C++ vulnerabilities across four difficulty levels.


### Software Engineering

- [Aider Polyglot: Multi-Language Code Editing Benchmark](https://mcpbr.org/aider-polyglot.html): Aider Polyglot evaluates AI agents on code editing tasks across Python, JavaScript, Go, Rust, and Java, using Exercism exercises with language-specific test suites.

- [APPS: 10,000 Coding Problems from Introductory to Competition Level](https://mcpbr.org/apps.html): APPS evaluates AI agents on 10,000 coding problems collected from open-access programming platforms, spanning introductory, interview, and competition difficulty levels.

- [BigCodeBench: 1,140 Practical Python Tasks Across 139 Libraries](https://mcpbr.org/bigcodebench.html): BigCodeBench evaluates AI agents on 1,140 practical coding tasks requiring function composition from 139 libraries across 7 domains, testing real-world library usage skills.

- [CodeContests: Competitive Programming Benchmark from Codeforces &amp; CodeChef](https://mcpbr.org/codecontests.html): CodeContests evaluates AI agents on competitive programming problems from Codeforces, CodeChef, and other platforms, featuring public and private test cases with time and memory constraints.

- [CoderEval: Real-World Code Generation from Open-Source Projects](https://mcpbr.org/codereval.html): CoderEval evaluates AI agents on pragmatic code generation from real-world open-source projects, testing the ability to implement functions with real project dependencies and context.

- [LeetCode: Algorithmic Coding Problems for AI Agent Evaluation](https://mcpbr.org/leetcode.html): LeetCode evaluates AI agents on algorithmic coding problems across easy, medium, and hard difficulty levels, covering data structures, algorithms, and common interview topics.

- [SWE-bench: Real GitHub Bug Fix Evaluation &amp; Setup Guide](https://mcpbr.org/swe-bench.html): SWE-bench evaluates AI agents on real-world GitHub bug fixes, testing their ability to generate patches that resolve actual software issues from popular Python repositories.


### Tool Use &amp; Agents

- [AgentBench: Autonomous Agent Evaluation Across OS, Database &amp; Web](https://mcpbr.org/agentbench.html): AgentBench benchmark for evaluating LLMs as autonomous agents across diverse environments including OS, databases, and web.

- [InterCode: Interactive Coding with Bash, SQL &amp; Python Interpreters](https://mcpbr.org/intercode.html): InterCode evaluates agents on interactive coding tasks requiring multi-turn interaction with Bash, SQL, and Python interpreters through observation-action loops.

- [MCPToolBench++: MCP Tool Discovery, Selection &amp; Invocation Benchmark](https://mcpbr.org/mcptoolbench.html): MCPToolBench++ evaluates AI agents on MCP tool discovery, selection, invocation, and result interpretation across 45+ categories with accuracy-threshold-based evaluation.

- [TerminalBench: Shell &amp; Terminal Task Evaluation for AI Agents](https://mcpbr.org/terminalbench.html): TerminalBench evaluates AI agents on practical terminal and shell tasks including file manipulation, system administration, scripting, and command-line tool usage with validation-command-based evaluation.

- [ToolBench: Real-World API Tool Selection &amp; Invocation Benchmark](https://mcpbr.org/toolbench.html): ToolBench benchmark for evaluating real-world API tool selection and invocation with proper parameters.

- [WebArena: Autonomous Web Agent Evaluation in Realistic Environments](https://mcpbr.org/webarena.html): WebArena evaluates autonomous web agents in realistic web environments featuring functional e-commerce sites, forums, content management systems, and maps with multi-step interaction tasks.


## Author
- [Grey Newell](https://greynewell.com)

## Source
- [GitHub](https://github.com/supermodeltools/mcpbr)