Math & Reasoning Benchmarks

3 benchmarks in this category

BigBench-Hard (BBH): 27 Challenging Reasoning Tasks Beyond Human Baseline
BigBench-Hard benchmark for mcpbr - 27 challenging reasoning tasks from BIG-Bench where language models score below average human performance.
GSM8K: Grade-School Math Reasoning Benchmark (1,319 Problems)
GSM8K evaluates mathematical reasoning on 1,319 grade-school math word problems, testing chain-of-thought reasoning and numeric answer extraction with tolerance-based comparison.
MATH: Competition Mathematics Benchmark (AMC, AIME, 12,500 Problems)
MATH benchmark for mcpbr - 12,500 competition mathematics problems from AMC, AIME, and other competitions across 7 subjects and 5 difficulty levels.

Benchmark Your MCP Server

Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.