SWE-bench: Real GitHub Bug Fix Evaluation & Setup Guide

Software Engineering

SWE-bench

Overview

Property Value
Benchmark ID swe-bench-verified, swe-bench-lite, swe-bench-full
Dataset SWE-bench/SWE-bench_Verified, SWE-bench/SWE-bench_Lite, SWE-bench/SWE-bench
Tasks 500 (Verified), 300 (Lite), 2,294 (Full)
Evaluation Apply unified diff patch, run FAIL_TO_PASS and PASS_TO_PASS test suites
Output Type Patch (unified diff)
Timeout 300-600s recommended

Quick Start

mcpbr run -c config.yaml --benchmark swe-bench-verified

Overview

SWE-bench is the gold-standard benchmark for evaluating AI agents on real-world software engineering tasks. Each task is a genuine GitHub issue from a popular open-source Python repository, and the agent must produce a unified diff patch that resolves the bug. The evaluation verifies the fix by running the repository's test suite, checking that previously failing tests now pass while existing passing tests remain unbroken.

SWE-bench is widely used by the research community and industry to measure progress in automated software engineering. mcpbr supports all three official variants:

Pre-built Docker images from Epoch AI are available for most tasks. These images include the repository checked out at the correct commit with all dependencies pre-installed and validated, providing faster and more reproducible evaluations.

Task Structure

Each SWE-bench task includes the following components:

The agent receives the problem statement and access to the repository at the base commit. It must analyze the codebase, identify the root cause, and generate a minimal patch that resolves the issue.

Running the Benchmark

CLI

# Run SWE-bench Verified (default, manually validated tests)
mcpbr run -c config.yaml --benchmark swe-bench-verified

# Run SWE-bench Lite (300 tasks, quick testing)
mcpbr run -c config.yaml --benchmark swe-bench-lite

# Run SWE-bench Full (2,294 tasks, comprehensive)
mcpbr run -c config.yaml --benchmark swe-bench-full

# Run a sample of 20 tasks
mcpbr run -c config.yaml --benchmark swe-bench-verified -n 20

# Run specific tasks by instance ID
mcpbr run -c config.yaml --benchmark swe-bench-verified -t django__django-11099

# Filter by repository
mcpbr run -c config.yaml --benchmark swe-bench-verified --filter-category django

# Filter by multiple repositories
mcpbr run -c config.yaml --benchmark swe-bench-verified \
  --filter-category django --filter-category scikit-learn

YAML

# SWE-bench Verified (recommended)
benchmark: "swe-bench-verified"
sample_size: 10
timeout_seconds: 300
use_prebuilt_images: true

# Optional: Filter by repository
filter_category:
  - "django"
  - "scikit-learn"

Alternative variant configurations:

# SWE-bench Lite for quick testing
benchmark: "swe-bench-lite"
sample_size: 20
timeout_seconds: 300
# SWE-bench Full for comprehensive evaluation
benchmark: "swe-bench-full"
sample_size: 50
timeout_seconds: 600

Evaluation Methodology

SWE-bench evaluation follows a rigorous multi-step process:

  1. Patch Generation: The agent analyzes the repository and produces a unified diff patch targeting the bug described in the problem statement.
  2. Patch Application: The generated patch is applied to the repository at the base commit using standard git apply or patch utilities.
  3. Test Patch Application: If the task includes a test patch (additional tests that verify the fix), it is applied on top of the agent's changes.
  4. FAIL_TO_PASS Verification: The tests listed in FAIL_TO_PASS are executed. All of these tests must now pass, confirming the bug has been fixed.
  5. PASS_TO_PASS Verification: The tests listed in PASS_TO_PASS are executed. All of these tests must continue to pass, confirming no regressions were introduced.
  6. Resolution: A task is marked as resolved only if the patch applies cleanly, all FAIL_TO_PASS tests pass, and all PASS_TO_PASS tests remain passing.

The evaluation uses pre-built Docker images when available (use_prebuilt_images: true), which include the repository at the correct commit with all Python dependencies installed and validated. This eliminates environment setup variability and produces more reliable results.

Example Output

Successful Resolution

{
  "instance_id": "django__django-11099",
  "resolved": true,
  "patch_applied": true,
  "fail_to_pass": {
    "passed": 3,
    "total": 3
  },
  "pass_to_pass": {
    "passed": 47,
    "total": 47
  }
}

Failed Resolution

{
  "instance_id": "scikit-learn__scikit-learn-13779",
  "resolved": false,
  "patch_applied": true,
  "fail_to_pass": {
    "passed": 1,
    "total": 2
  },
  "pass_to_pass": {
    "passed": 45,
    "total": 47
  }
}

Patch Application Failure

{
  "instance_id": "sympy__sympy-18199",
  "resolved": false,
  "patch_applied": false,
  "eval_error": "Patch failed to apply: hunks FAILED -- saving rejects to file"
}

Troubleshooting

Patch fails to apply cleanly The agent's patch may target incorrect line numbers or file paths. Ensure the agent is working with the correct version of the repository. Pre-built images guarantee the repository is at the exact base commit. If building from scratch, verify the checkout succeeded.

PASS_TO_PASS tests fail after patch The agent introduced a regression. This often happens when the fix is too broad or modifies shared utility functions. Encourage the agent to make minimal, targeted changes by using a focused prompt template.

Evaluation times out SWE-bench tasks involving large repositories or complex test suites may need longer timeouts. Increase timeout_seconds to 600 or higher for repositories like Django or Matplotlib. Tasks from smaller repositories like Flask typically complete within 300 seconds.

Docker image pull fails Pre-built images from Epoch AI may not be available for all tasks. If an image pull fails, mcpbr falls back to building the environment from scratch. Set use_prebuilt_images: false to always build from scratch, though this is slower and less reliable.

Best Practices

Related Links

Frequently Asked Questions

What is the difference between SWE-bench Verified, Lite, and Full?

SWE-bench Verified (500 tasks) contains manually validated test cases for the most reliable evaluation. Lite (300 tasks) is a curated subset ideal for quick testing. Full (2,294 tasks) is the complete dataset for comprehensive research evaluations.

Do I need pre-built Docker images for SWE-bench?

Pre-built images from Epoch AI are recommended but not required. They include the repository at the correct commit with all dependencies pre-installed, which provides faster and more reliable evaluation. Without them, mcpbr will build environments from scratch.

How do I filter SWE-bench tasks by repository?

Use the filter_category option with repository name patterns. For example, '--filter-category django' will select only Django-related tasks. You can specify multiple categories to include tasks from several repositories.

How do I set up the SWE-bench repository environment and install dependencies for Django tasks?

mcpbr handles repository environment setup automatically when you use pre-built Docker images (use_prebuilt_images: true). Each image contains the repository checked out at the correct commit with all Python dependencies pre-installed. For Django tasks specifically, this includes Django itself and its test dependencies. Without pre-built images, mcpbr clones the repository and runs the installation steps from scratch, which can be slower and less reliable for complex projects like Django.

What Python repositories are included in SWE-bench?

SWE-bench includes tasks from 12 popular Python repositories: django/django, scikit-learn/scikit-learn, matplotlib/matplotlib, sympy/sympy, pytest-dev/pytest, astropy/astropy, psf/requests, sphinx-doc/sphinx, pydata/xarray, pylint-dev/pylint, mwaskom/seaborn, and pallets/flask.

How long does a SWE-bench evaluation take to run?

Individual tasks typically complete in 60-300 seconds. With pre-built Docker images, a 10-task sample usually finishes in 10-20 minutes. Running the full SWE-bench Verified (500 tasks) can take several hours depending on your hardware and the MCP server's response time. Use the -n flag to set sample size for faster iteration.

mcpbr — Open-Source Benchmark Runner for MCP Servers

mcpbr is developed by Grey Newell, a Computer Science researcher at Georgia Institute of Technology specializing in AI systems and the Model Context Protocol.

Benchmark Your MCP Server

Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.

Get Started Browse Benchmarks

Created by Grey Newell