ARC: AI2 Reasoning Challenge for Grade-School Science Questions

Knowledge & QA

ARC (AI2 Reasoning Challenge)

Overview

Property Value
Benchmark ID arc
Dataset allenai/ai2_arc
Tasks 7,787 questions (Challenge + Easy)
Evaluation Letter match against answer key
Output Type Single letter (A-E)
Timeout 60-180 seconds

Quick Start

mcpbr run -c config.yaml --benchmark arc -n 20

Overview

The AI2 Reasoning Challenge (ARC) is a benchmark consisting of 7,787 genuine grade-school level science questions assembled from standardized tests. The dataset is partitioned into two subsets:

Each question is multiple-choice with 3 to 5 answer options labeled with letters (A through E) or occasionally numbers (1 through 5). The questions cover topics such as:

ARC is useful for evaluating:

Task Structure

Each ARC task contains the following fields:

The agent receives the question with all labeled answer options and must respond with the letter of the correct answer.

Example Task

Answer the following science question:

A student is studying the properties of different materials. Which of the
following is the best conductor of electricity?

Options:
  (A) Glass rod
  (B) Copper wire
  (C) Rubber band
  (D) Wooden stick

Correct Answer: B

Running the Benchmark

CLI

# Run ARC-Challenge (default)
mcpbr run -c config.yaml --benchmark arc

# Run a small sample
mcpbr run -c config.yaml --benchmark arc -n 20

# Run ARC-Easy subset
mcpbr run -c config.yaml --benchmark arc --filter-difficulty easy

# Explicitly run ARC-Challenge
mcpbr run -c config.yaml --benchmark arc --filter-difficulty challenge

# Run with verbose output and save results
mcpbr run -c config.yaml --benchmark arc -n 50 -v -o results.json

YAML

benchmark: "arc"
sample_size: 10
timeout_seconds: 120

# Optional: select the Easy subset instead of Challenge
filter_difficulty:
  - "easy"

Difficulty Filtering

ARC uses filter_difficulty to select between the two subsets:

Filter Value Subset Questions Description
challenge ARC-Challenge 2,590 Hard questions requiring reasoning (default)
easy ARC-Easy 5,197 Simpler questions answerable with retrieval

When no filter_difficulty is specified, ARC-Challenge is used by default. If multiple values are provided, the last recognized value takes precedence.

Evaluation Methodology

ARC evaluation uses letter-matching:

  1. Answer extraction: The evaluation uses a regex pattern \b([A-E1-5])\b to find all standalone letters (A-E) or digits (1-5) in the model's uppercase response.

  2. Last-match selection: The last matching letter or digit is used as the model's answer. This accommodates reasoning where the model discusses multiple options before settling on a final answer.

  3. Case-insensitive comparison: The extracted answer is compared to the answerKey field. Both are uppercased before comparison.

Scoring

resolved = (extracted_answer.upper() == answer_key.upper())

Where:

Answer Format

The model should respond with a single letter. The evaluation handles various response formats:

Example Output

Successful Evaluation

{
  "resolved": true,
  "agent_answer": "B",
  "correct_answer": "B"
}

Failed Evaluation (Wrong Answer)

{
  "resolved": false,
  "agent_answer": "A",
  "correct_answer": "B"
}

Failed Evaluation (No Answer Extracted)

{
  "resolved": false,
  "error": "Could not extract answer from solution"
}

This occurs when the model's response does not contain any standalone letter in A-E or digit in 1-5.

Troubleshooting

Model discusses all options without clear final answer

If the model provides analysis of each option without stating a final letter, the regex extraction uses the last letter mentioned, which may be incorrect. Configure your prompt to enforce a clear final answer:

agent_prompt: |
  {problem_statement}

  Think through the science carefully, then respond with ONLY the letter of the correct answer (A, B, C, D, or E).

Evaluation extracts wrong letter from verbose response

Since the evaluation takes the last letter match, a response like "The answer is B. Options A and D are clearly wrong." would extract D instead of B. Instruct the model to place its final answer at the end of the response, or to respond with only the letter.

Switching between Challenge and Easy returns unexpected counts

When using filter_difficulty, ensure you pass exactly one value. If both "easy" and "challenge" are specified, the implementation checks for "easy" first, then "challenge". The last matching condition sets the subset.

Questions with numeric answer labels (1-5)

Some ARC questions use numeric labels (1, 2, 3, 4, 5) instead of letters (A, B, C, D, E). The evaluation handles both formats automatically through the regex pattern.

Best Practices

Related Links

Frequently Asked Questions

What is the difference between ARC-Challenge and ARC-Easy?

ARC-Challenge contains questions that require multi-step reasoning and cannot be answered by simple retrieval or word co-occurrence. ARC-Easy contains the remaining questions. By default, mcpbr uses ARC-Challenge. Use filter_difficulty with 'easy' or 'challenge' to switch subsets.

How many answer options does each ARC question have?

ARC questions have between 3 and 5 answer options, typically labeled A through E. The model must respond with the letter of the correct answer.

How does ARC evaluation extract the model's answer?

The evaluation uses regex to find standalone letters (A-E) or digits (1-5) in the model's response, takes the last match, and compares it case-insensitively against the answer key.

mcpbr — Open-Source Benchmark Runner for MCP Servers

mcpbr is developed by Grey Newell, a Computer Science researcher at Georgia Institute of Technology specializing in AI systems and the Model Context Protocol.

Benchmark Your MCP Server

Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.

Get Started Browse Benchmarks

Created by Grey Newell