Software Engineering Benchmarks
7 benchmarks in this category
-
Aider Polyglot: Multi-Language Code Editing Benchmark
Aider Polyglot evaluates AI agents on code editing tasks across Python, JavaScript, Go, Rust, and Java, using Exercism exercises with language-specific test suites.
-
APPS: 10,000 Coding Problems from Introductory to Competition Level
APPS evaluates AI agents on 10,000 coding problems collected from open-access programming platforms, spanning introductory, interview, and competition difficulty levels.
-
BigCodeBench: 1,140 Practical Python Tasks Across 139 Libraries
BigCodeBench evaluates AI agents on 1,140 practical coding tasks requiring function composition from 139 libraries across 7 domains, testing real-world library usage skills.
-
CodeContests: Competitive Programming Benchmark from Codeforces & CodeChef
CodeContests evaluates AI agents on competitive programming problems from Codeforces, CodeChef, and other platforms, featuring public and private test cases with time and memory constraints.
-
CoderEval: Real-World Code Generation from Open-Source Projects
CoderEval evaluates AI agents on pragmatic code generation from real-world open-source projects, testing the ability to implement functions with real project dependencies and context.
-
LeetCode: Algorithmic Coding Problems for AI Agent Evaluation
LeetCode evaluates AI agents on algorithmic coding problems across easy, medium, and hard difficulty levels, covering data structures, algorithms, and common interview topics.
-
SWE-bench: Real GitHub Bug Fix Evaluation & Setup Guide
SWE-bench evaluates AI agents on real-world GitHub bug fixes, testing their ability to generate patches that resolve actual software issues from popular Python repositories.
Benchmark Your MCP Server
Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.
Get Started Browse BenchmarksCreated by Grey Newell