HellaSwag: Commonsense Reasoning Through Sentence Completion

Knowledge & QA

HellaSwag

Overview

Property Value
Benchmark ID hellaswag
Dataset Rowan/hellaswag
Tasks ~10,000 validation examples
Evaluation Exact match of selected option (0-3) against correct label
Output Type Single digit (0-3)
Timeout 60-180 seconds

Quick Start

mcpbr run -c config.yaml --benchmark hellaswag -n 20

Overview

HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations) is a commonsense reasoning benchmark where the model must choose the most plausible continuation of a scenario from four options. The dataset was created through adversarial filtering, meaning the incorrect options are specifically designed to fool language models while remaining easy for humans to distinguish.

Human accuracy on HellaSwag is approximately 95%, making it a strong test of whether language models truly understand commonsense physical and social situations rather than relying on surface-level statistical patterns.

Key characteristics of HellaSwag:

Task Structure

Each HellaSwag task contains the following fields:

The agent receives the context and all four options, then must select the most plausible continuation by responding with the corresponding number.

Example Task

Scenario: A woman is seen sitting at a table with a bowl in front of her.
She picks up a whisk and begins stirring the contents of the bowl.

Options:
  (0) She then places the bowl in the oven and waits.
  (1) She adds flour to the bowl and continues mixing until smooth.
  (2) She throws the bowl across the room and walks away.
  (3) She picks up a phone and makes a call while stirring.

Correct Answer: 1

Running the Benchmark

CLI

# Run HellaSwag with default settings
mcpbr run -c config.yaml --benchmark hellaswag

# Run a small sample
mcpbr run -c config.yaml --benchmark hellaswag -n 20

# Filter by activity type
mcpbr run -c config.yaml --benchmark hellaswag --filter-category "baking cookies"

# Run with verbose output and save results
mcpbr run -c config.yaml --benchmark hellaswag -n 100 -v -o results.json

YAML

benchmark: "hellaswag"
sample_size: 10
timeout_seconds: 120

# Optional: filter to specific activity types
filter_category:
  - "baking cookies"
  - "playing basketball"

Activity Label Filtering

HellaSwag supports filtering by activity label using filter_category. Activity labels describe the type of scenario and come from WikiHow article titles and ActivityNet captions. Examples include:

Activity label matching is case-insensitive.

Evaluation Methodology

HellaSwag evaluation follows a straightforward approach:

  1. Answer extraction: The evaluation uses a regex pattern \b([0-3])\b to find all standalone digits in the range 0-3 within the model's response.

  2. Last-match selection: The last matching digit in the response is used as the model's answer. This handles cases where the model discusses multiple options before stating its final choice.

  3. Exact comparison: The extracted digit is compared to the ground truth label field. The task is resolved if and only if they match exactly.

Scoring

resolved = (extracted_answer == correct_label)

Where:

Answer Format

The model should respond with a single digit. The evaluation is forgiving about surrounding text -- it will extract the answer from responses like:

Example Output

Successful Evaluation

{
  "resolved": true,
  "agent_answer": "1",
  "correct_label": "1"
}

Failed Evaluation (Wrong Answer)

{
  "resolved": false,
  "agent_answer": "3",
  "correct_label": "1"
}

Failed Evaluation (No Answer Extracted)

{
  "resolved": false,
  "error": "Could not extract option number from solution"
}

This occurs when the model's response does not contain any standalone digit in the range 0-3.

Troubleshooting

Model provides reasoning but no clear answer

If the model gives a long explanation without a clear numeric answer, the regex extraction may fail or pick the wrong number. Configure your agent prompt to explicitly request a single digit:

agent_prompt: |
  {problem_statement}

  Respond with ONLY the number of the correct option (0, 1, 2, or 3). Do not include any explanation.

Evaluation picks the wrong number from the response

The evaluation uses the last digit (0-3) found in the response. If the model discusses multiple options (e.g., "Option 0 is unlikely, option 2 is too extreme, so 1 is best"), it correctly selects 1. However, if the model says "I choose 1, as options 0 and 3 are wrong", it would incorrectly select 3. Instruct the model to state its final answer last.

Low accuracy despite seemingly reasonable answers

HellaSwag was adversarially filtered specifically to trick language models. The incorrect options are designed to be statistically plausible even though they are wrong from a commonsense perspective. This is by design and measures genuine commonsense reasoning rather than pattern matching.

Filter category returns no results

Activity labels in HellaSwag can be quite specific (e.g., "Baking cookies" rather than just "cooking"). Inspect the dataset to find exact labels:

uv run python -c "
from datasets import load_dataset
ds = load_dataset('Rowan/hellaswag', split='validation')
labels = sorted(set(item['activity_label'] for item in ds))
for label in labels[:20]:
    print(label)
print(f'... ({len(labels)} total labels)')
"

Best Practices

Related Links

Frequently Asked Questions

What does HellaSwag measure?

HellaSwag measures commonsense reasoning ability by presenting a scenario and four possible continuations. The model must select the most plausible option. The dataset is adversarially filtered so that incorrect options are deceptively plausible to language models while easy for humans (~95% accuracy).

How should the model format its answer for HellaSwag?

The model should respond with a single digit (0, 1, 2, or 3) corresponding to the chosen option. The evaluation extracts the last occurrence of a digit in the 0-3 range from the response using regex matching.

Can I filter HellaSwag tasks by topic?

Yes, use the filter_category parameter with activity labels such as 'cooking', 'cleaning', 'sports', etc. Activity labels are matched case-insensitively.

mcpbr — Open-Source Benchmark Runner for MCP Servers

mcpbr is developed by Grey Newell, a Computer Science researcher at Georgia Institute of Technology specializing in AI systems and the Model Context Protocol.

Benchmark Your MCP Server

Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.

Get Started Browse Benchmarks

Created by Grey Newell