APPS: 10,000 Coding Problems from Introductory to Competition Level

Software Engineering

APPS

Overview

Property Value
Benchmark ID apps
Dataset metr-evals/apps
Tasks 10,000 coding problems
Evaluation stdin/stdout test case comparison
Output Type Test pass rate
Timeout 180-300s recommended

Quick Start

mcpbr run -c config.yaml --benchmark apps

Overview

APPS (Automated Programming Progress Standard) is a large-scale coding benchmark containing 10,000 problems collected from open-access coding websites such as Codeforces, Kattis, and other competitive programming platforms. The benchmark tests a broad range of programming skills, from basic string manipulation and arithmetic to advanced algorithmic problem solving involving dynamic programming, graph theory, and complex data structures.

Each problem provides a natural language description along with input/output specifications and test cases. The agent must generate a Python program that reads from standard input, processes the data according to the problem requirements, and writes the correct output to standard output.

APPS problems are categorized into three difficulty levels:

Task Structure

Each APPS task includes the following components:

The agent receives the problem statement with difficulty information and must produce a self-contained Python program that handles all specified test cases correctly.

Running the Benchmark

CLI

# Run APPS with default settings
mcpbr run -c config.yaml --benchmark apps

# Run a sample of 20 problems
mcpbr run -c config.yaml --benchmark apps -n 20

# Run a specific task
mcpbr run -c config.yaml --benchmark apps -t apps_42

# Filter by difficulty level
mcpbr run -c config.yaml --benchmark apps --filter-difficulty introductory

# Filter for interview and competition problems only
mcpbr run -c config.yaml --benchmark apps \
  --filter-difficulty interview --filter-difficulty competition

YAML

benchmark: "apps"
sample_size: 10
timeout_seconds: 300

# Optional: Filter by difficulty
filter_difficulty:
  - "introductory"

Configuration for harder problems:

benchmark: "apps"
sample_size: 20
timeout_seconds: 300

filter_difficulty:
  - "interview"
  - "competition"

Evaluation Methodology

APPS evaluation is based on direct input/output comparison:

  1. Solution Writing: The agent's generated code is written to solution.py inside the Docker container.
  2. Test Case Execution: For each test case, the input string is piped to the solution via stdin.
  3. Output Comparison: The program's stdout is captured and compared (after stripping whitespace) against the expected output string.
  4. Pass Rate Calculation: The evaluation counts the number of test cases passed out of the total. A task is marked as resolved only if all test cases pass.
  5. Result Reporting: The result includes the number of passed tests, total tests, and the overall pass rate.

Each test case execution has an individual timeout of 10 seconds to prevent infinite loops or excessively slow solutions from blocking the evaluation.

Example Output

Successful Resolution

{
  "instance_id": "apps_42",
  "resolved": true,
  "passed": 5,
  "total": 5,
  "pass_rate": 1.0
}

Partial Pass

{
  "instance_id": "apps_157",
  "resolved": false,
  "passed": 3,
  "total": 5,
  "pass_rate": 0.6
}

No Test Cases

{
  "instance_id": "apps_999",
  "resolved": false,
  "error": "No test cases provided"
}

Troubleshooting

Solution produces wrong output format APPS problems require exact output matching. Ensure the agent does not include extra whitespace, trailing newlines, or debug print statements. The comparison strips leading and trailing whitespace, but any differences in the body of the output will cause a failure.

Solution times out on individual test cases Each test case has a 10-second execution limit. Competition-level problems with large inputs may require optimized algorithms. If the agent produces a brute-force solution, it may pass small test cases but time out on larger ones. Encourage the agent to analyze time complexity constraints.

Test case parsing fails If the test cases in the dataset cannot be parsed as JSON, the evaluation returns an error. This is rare but can occur with malformed entries. Use --task to skip problematic tasks and report the issue.

Solution reads input incorrectly APPS problems expect input to be read from stdin. The agent must use input() or sys.stdin in Python. Solutions that hardcode test values or read from files will fail. Verify that the prompt template instructs the agent to use stdin/stdout.

Best Practices

Related Links

Frequently Asked Questions

What difficulty levels does APPS support?

APPS has three difficulty levels: introductory (basic programming concepts), interview (typical coding interview questions), and competition (competitive programming difficulty). Use filter_difficulty to select specific levels.

How are APPS solutions evaluated?

Solutions are evaluated by running the generated code with provided test case inputs via stdin and comparing the stdout output against expected outputs. All test cases must pass for a task to be marked as resolved.

What format should APPS solutions use?

Solutions must be Python programs that read from stdin and write to stdout. The agent should save its solution to a file named 'solution.py' in the working directory.

mcpbr — Open-Source Benchmark Runner for MCP Servers

mcpbr is developed by Grey Newell, a Computer Science researcher at Georgia Institute of Technology specializing in AI systems and the Model Context Protocol.

Benchmark Your MCP Server

Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.

Get Started Browse Benchmarks

Created by Grey Newell