Math & Reasoning Benchmarks
3 benchmarks in this category
-
BigBench-Hard (BBH): 27 Challenging Reasoning Tasks Beyond Human Baseline
BigBench-Hard benchmark for mcpbr - 27 challenging reasoning tasks from BIG-Bench where language models score below average human performance.
-
GSM8K: Grade-School Math Reasoning Benchmark (1,319 Problems)
GSM8K evaluates mathematical reasoning on 1,319 grade-school math word problems, testing chain-of-thought reasoning and numeric answer extraction with tolerance-based comparison.
-
MATH: Competition Mathematics Benchmark (AMC, AIME, 12,500 Problems)
MATH benchmark for mcpbr - 12,500 competition mathematics problems from AMC, AIME, and other competitions across 7 subjects and 5 difficulty levels.
Benchmark Your MCP Server
Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.
Get Started Browse BenchmarksCreated by Grey Newell