# mcpbr > Open-source benchmark runner for evaluating MCP servers and AI agents across 25+ benchmarks ## Documentation - [Installation](https://mcpbr.org/installation.html) - [Configuration](https://mcpbr.org/configuration.html) - [CLI Reference](https://mcpbr.org/cli.html) - [About](https://mcpbr.org/about.html) ## Benchmark Categories ### Code Generation - [HumanEval: OpenAI Python Programming Benchmark (164 Problems)](https://mcpbr.org/humaneval.html): HumanEval evaluates AI agents on 164 Python programming problems from OpenAI, testing code generation from function signatures and docstrings with unit test verification. - [MBPP: Mostly Basic Python Programming Problems Benchmark](https://mcpbr.org/mbpp.html): MBPP benchmark for mcpbr - ~1,000 crowd-sourced Python programming problems designed for entry-level programmers. ### Code Understanding - [RepoQA: Long-Context Code Understanding & Function Search](https://mcpbr.org/repoqa.html): RepoQA evaluates long-context code understanding by testing whether agents can find and identify specific functions within large repository codebases. ### Knowledge & QA - [ARC: AI2 Reasoning Challenge for Grade-School Science Questions](https://mcpbr.org/arc.html): ARC (AI2 Reasoning Challenge) benchmark for evaluating grade-school science reasoning with multiple-choice questions. - [GAIA: General AI Assistant Benchmark for Reasoning & Tool Use](https://mcpbr.org/gaia.html): GAIA evaluates general AI assistant capabilities including multi-step reasoning, web browsing, tool use, and multi-modality on real-world questions with unambiguous answers. - [HellaSwag: Commonsense Reasoning Through Sentence Completion](https://mcpbr.org/hellaswag.html): HellaSwag benchmark for evaluating commonsense reasoning through adversarially filtered sentence completion. - [TruthfulQA: Evaluating AI Truthfulness & Misconception Resistance](https://mcpbr.org/truthfulqa.html): TruthfulQA benchmark for evaluating truthfulness and resistance to common misconceptions across 38 categories. ### Math & Reasoning - [BigBench-Hard (BBH): 27 Challenging Reasoning Tasks Beyond Human Baseline](https://mcpbr.org/bigbench-hard.html): BigBench-Hard benchmark for mcpbr - 27 challenging reasoning tasks from BIG-Bench where language models score below average human performance. - [GSM8K: Grade-School Math Reasoning Benchmark (1,319 Problems)](https://mcpbr.org/gsm8k.html): GSM8K evaluates mathematical reasoning on 1,319 grade-school math word problems, testing chain-of-thought reasoning and numeric answer extraction with tolerance-based comparison. - [MATH: Competition Mathematics Benchmark (AMC, AIME, 12,500 Problems)](https://mcpbr.org/math.html): MATH benchmark for mcpbr - 12,500 competition mathematics problems from AMC, AIME, and other competitions across 7 subjects and 5 difficulty levels. ### ML Research - [MLAgentBench: Real ML Research Tasks from Kaggle Competitions](https://mcpbr.org/mlagentbench.html): MLAgentBench evaluates AI agents on real ML research tasks based on Kaggle competitions, testing their ability to train models, improve performance metrics, and debug ML pipelines. ### Security - [CyberGym: Cybersecurity Exploit Generation Benchmark for AI Agents](https://mcpbr.org/cybergym.html): CyberGym is a cybersecurity benchmark from UC Berkeley where agents generate Proof-of-Concept exploits for real C/C++ vulnerabilities across four difficulty levels. ### Software Engineering - [Aider Polyglot: Multi-Language Code Editing Benchmark](https://mcpbr.org/aider-polyglot.html): Aider Polyglot evaluates AI agents on code editing tasks across Python, JavaScript, Go, Rust, and Java, using Exercism exercises with language-specific test suites. - [APPS: 10,000 Coding Problems from Introductory to Competition Level](https://mcpbr.org/apps.html): APPS evaluates AI agents on 10,000 coding problems collected from open-access programming platforms, spanning introductory, interview, and competition difficulty levels. - [BigCodeBench: 1,140 Practical Python Tasks Across 139 Libraries](https://mcpbr.org/bigcodebench.html): BigCodeBench evaluates AI agents on 1,140 practical coding tasks requiring function composition from 139 libraries across 7 domains, testing real-world library usage skills. - [CodeContests: Competitive Programming Benchmark from Codeforces & CodeChef](https://mcpbr.org/codecontests.html): CodeContests evaluates AI agents on competitive programming problems from Codeforces, CodeChef, and other platforms, featuring public and private test cases with time and memory constraints. - [CoderEval: Real-World Code Generation from Open-Source Projects](https://mcpbr.org/codereval.html): CoderEval evaluates AI agents on pragmatic code generation from real-world open-source projects, testing the ability to implement functions with real project dependencies and context. - [LeetCode: Algorithmic Coding Problems for AI Agent Evaluation](https://mcpbr.org/leetcode.html): LeetCode evaluates AI agents on algorithmic coding problems across easy, medium, and hard difficulty levels, covering data structures, algorithms, and common interview topics. - [SWE-bench: Real GitHub Bug Fix Evaluation & Setup Guide](https://mcpbr.org/swe-bench.html): SWE-bench evaluates AI agents on real-world GitHub bug fixes, testing their ability to generate patches that resolve actual software issues from popular Python repositories. ### Tool Use & Agents - [AgentBench: Autonomous Agent Evaluation Across OS, Database & Web](https://mcpbr.org/agentbench.html): AgentBench benchmark for evaluating LLMs as autonomous agents across diverse environments including OS, databases, and web. - [InterCode: Interactive Coding with Bash, SQL & Python Interpreters](https://mcpbr.org/intercode.html): InterCode evaluates agents on interactive coding tasks requiring multi-turn interaction with Bash, SQL, and Python interpreters through observation-action loops. - [MCPToolBench++: MCP Tool Discovery, Selection & Invocation Benchmark](https://mcpbr.org/mcptoolbench.html): MCPToolBench++ evaluates AI agents on MCP tool discovery, selection, invocation, and result interpretation across 45+ categories with accuracy-threshold-based evaluation. - [TerminalBench: Shell & Terminal Task Evaluation for AI Agents](https://mcpbr.org/terminalbench.html): TerminalBench evaluates AI agents on practical terminal and shell tasks including file manipulation, system administration, scripting, and command-line tool usage with validation-command-based evaluation. - [ToolBench: Real-World API Tool Selection & Invocation Benchmark](https://mcpbr.org/toolbench.html): ToolBench benchmark for evaluating real-world API tool selection and invocation with proper parameters. - [WebArena: Autonomous Web Agent Evaluation in Realistic Environments](https://mcpbr.org/webarena.html): WebArena evaluates autonomous web agents in realistic web environments featuring functional e-commerce sites, forums, content management systems, and maps with multi-step interaction tasks. ## Author - [Grey Newell](https://greynewell.com) ## Source - [GitHub](https://github.com/supermodeltools/mcpbr)