Software Engineering Benchmarks

7 benchmarks in this category

Aider Polyglot: Multi-Language Code Editing Benchmark
Aider Polyglot evaluates AI agents on code editing tasks across Python, JavaScript, Go, Rust, and Java, using Exercism exercises with language-specific test suites.
APPS: 10,000 Coding Problems from Introductory to Competition Level
APPS evaluates AI agents on 10,000 coding problems collected from open-access programming platforms, spanning introductory, interview, and competition difficulty levels.
BigCodeBench: 1,140 Practical Python Tasks Across 139 Libraries
BigCodeBench evaluates AI agents on 1,140 practical coding tasks requiring function composition from 139 libraries across 7 domains, testing real-world library usage skills.
CodeContests: Competitive Programming Benchmark from Codeforces & CodeChef
CodeContests evaluates AI agents on competitive programming problems from Codeforces, CodeChef, and other platforms, featuring public and private test cases with time and memory constraints.
CoderEval: Real-World Code Generation from Open-Source Projects
CoderEval evaluates AI agents on pragmatic code generation from real-world open-source projects, testing the ability to implement functions with real project dependencies and context.
LeetCode: Algorithmic Coding Problems for AI Agent Evaluation
LeetCode evaluates AI agents on algorithmic coding problems across easy, medium, and hard difficulty levels, covering data structures, algorithms, and common interview topics.
SWE-bench: Real GitHub Bug Fix Evaluation & Setup Guide
SWE-bench evaluates AI agents on real-world GitHub bug fixes, testing their ability to generate patches that resolve actual software issues from popular Python repositories.

Benchmark Your MCP Server

Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.

Get Started Browse Benchmarks

Created by Grey Newell