AI Coding Benchmarks Have Been Marking One in Four Solutions Wrong

In May 2026, a small company called Datacurve published a coding agent benchmark called DeepSWE. The announcement was modest — 113 tasks, 91 repositories, a new harness. What made it unusual was the prelude: before presenting their own results, Wenqi Huang, Charley Lee, Leonard Tng, and Serena Ge spent several pages explaining what is wrong with the benchmark their work is trying to replace. Specifically, they audited 789 rollouts on 30 randomly selected SWE-bench Pro tasks and found that the verifiers — the automated graders that decide whether a coding agent’s solution is correct — reject correct implementations at a rate of 24.0%. That is one in four right answers marked wrong. The false positive rate, where wrong implementations get through, was 8.5%.

SWE-bench Pro is the dominant yardstick for frontier coding agents. When a lab announces that its model leads on software engineering benchmarks, SWE-bench Pro is usually what they mean. The original SWE-bench, introduced at ICLR 2024, sourced tasks from real GitHub issues and PRs — a clever design that meant real engineers had already validated the problem was solvable. SWE-bench Pro extended that approach with harder, longer tasks and a broader repository pool. It became the field’s shared standard because it was the hardest credible benchmark available and because the Princeton NLP team that built swebench.com made it freely accessible. The leaderboard it generates is cited in investor memos, used to justify pricing tiers, and quoted in the same breath as model releases. For the past year, every percentage point on that leaderboard has been a racing stripe. The Datacurve audit says roughly a quarter of those stripes were painted on wrong.

Why the Verifiers Break

The structural problem is in how SWE-bench Pro tasks are constructed. Most SWE-bench tasks are sourced from real GitHub pull requests: the test suite that merged with the PR becomes the verifier. This is elegant in principle — you inherit human-validated correctness tests for free. The problem is that those tests were written to verify the merged implementation’s behavior, not to evaluate arbitrary future submissions against the task’s behavioral specification. A correct alternative solution that doesn’t match the specific code paths the original tests probe will fail the verifier even if it solves the problem. That is the false negative. An incorrect submission that happens to satisfy the tests will pass even if the implementation is broken in ways the tests don’t cover. That is the false positive.

Datacurve’s verifiers work differently. Rather than inheriting test suites from merged PRs, each verifier is purpose-written by a human after reading the task description. It tests through public APIs and observable outputs — the system’s behavior rather than its implementation details. Any solution that makes the system behave as specified passes, regardless of how it was written. In their own quality audit, conducted on 735 reviewed rollouts, the DeepSWE verifiers produced a 0.3% false positive rate and a 1.1% false negative rate. The gap between these numbers and SWE-bench Pro’s is not marginal. It is the difference between a grader that makes a handful of errors and one that systematically fails a quarter of correct work.

There is also a contamination issue. DeepSWE’s tasks are written from scratch, not adapted from existing commits or PRs. They are never merged back into upstream repositories. The point is to prevent solutions from appearing in future pre-training data and to avoid the situation, confirmed in the SWE-bench+ audit published in October 2024 at arxiv.org/abs/2410.06992, where roughly a third of “successful” patches on the original SWE-bench involved answers that were simply present in the issue report or comments. The Datacurve team estimates contamination and false positives affected roughly 8% of their own SWE-bench Pro audit rollouts. The verifier failure rate is separate and additive.

The Benchmark Mismatch Problem

Contamination and verifier errors are the quantifiable problems. There is a quieter structural mismatch underneath them. SWE-bench Pro’s mean task requires an agent to add 120 lines of code across 5 files. The mean DeepSWE task requires 668 lines across 7 files — 5.5 times more code, with prompts deliberately kept short (2,158 characters on average) to reflect how developers actually talk to agents: behavioral, underspecified, exploratory. The model has to figure out where the change needs to happen. SWE-bench Verified, the older standard, averages just 10 lines and 1 file. The benchmark the industry graduated to, SWE-bench Pro, is larger than its predecessor but still a fraction of what a real long-horizon engineering task looks like.

This is not a criticism unique to SWE-bench. Any benchmark that sources tasks from existing pull requests has a structural selection bias toward changes that humans found tractable enough to actually merge. Hard, multi-file architectural changes tend not to land as clean PRs with tidy test suites. They get squashed, rebased, or replaced by a different approach. The benchmark ends up reflecting a distribution of software tasks that skews toward the well-specified and self-contained, because those are the ones that make clean commits.

What the Corrected Leaderboard Shows

On DeepSWE’s 113 tasks, run through a standardized harness called mini-swe-agent, the leaderboard looks different from what SWE-bench Pro produces. As of May 2026, GPT-5.5 at extended compute leads with 70% ±4%, followed by GPT-5.4 at 56% ±5% and Claude Opus 4.7 at 54% ±5%. Claude Sonnet 4.6 scores 32% ±4%. Gemini 3.5-Flash at 28% ±4% and GPT-5.4-mini at 24% ±4% are effectively tied. Eleven places below GPT-5.5, Gemini 3-Flash scores 5% ±2%.

The wide error bars matter. At ±5%, Claude Opus 4.7 and GPT-5.4 are statistically tied — their reported scores of 54% and 56% sit within each other’s confidence intervals. What the leaderboard does show clearly is the gap structure: a large tier separation between GPT-5.5 and everyone else, a tight cluster of competitive models in the mid-range, and a long tail where much-hyped open-weight models like DeepSeek-v4-pro (8% ±2%) are nowhere near the frontier. The GLM-5.1 score of 18% ±3% and Gemini 3.1-pro’s 10% ±3% underscore that international competition on truly novel long-horizon tasks remains uneven. These gaps are not visible on SWE-bench Pro because the shorter, better-specified tasks compress the distribution. When the problems get harder and the verifiers get honest, the performance spread widens. This distribution is plausible. The question is whether it reflects the models’ actual capability or the harness.

The Critique That Needs to Land

Datacurve’s audit methodology is genuinely valuable. A 24% false negative rate on a benchmark the field uses to rank frontier models is an important finding, and the verifier quality numbers — 0.3% FP, 1.1% FN over 735 rollouts — are a substantially higher bar. But there are legitimate questions about DeepSWE itself that the paper acknowledges and the coverage will probably miss.

The standardized harness is the most consequential. Every model in DeepSWE runs through mini-swe-agent, which exposes a single bash tool and a fixed system prompt. The authors explicitly note that “routing every edit through bash may hold them below their native ceiling.” GPT-5.5 and GPT-5.4, the top two models, both run at the benchmark’s extended compute tier. The paper’s comparison between mini-swe-agent and native harnesses rests on a 10-task pilot — not a sample you can build strong conclusions from. The standardization prevents harness effects from contaminating comparisons between models, but it also compresses the top end of performance. Any model that performs better with proprietary tooling — and several of the front-runners do — gets measured at a floor, not a ceiling.

The sample size is another constraint. At 113 tasks, error bars of ±3-5% are unavoidable. The qualitative per-language and per-repository analysis the paper includes is described by its own authors as “illustrative” rather than statistically robust. A benchmark this size is sufficient for ordering the top tier and identifying catastrophic failures. It is not sufficient for distinguishing close pairs with confidence.

The bigger structural question is about Datacurve itself. This is a company benchmark, not an academic one. The authors have no apparent financial interest in any of the labs whose models they test, and the methodology is reproducible — the tasks and verifiers will be released. But the AI field has a history of benchmark providers whose independence is compromised by the same labs they rate. Datacurve is not currently in that category. The community should keep watching to ensure it stays that way.

None of this neutralizes the audit finding. A 24% false negative rate on SWE-bench Pro is damaging regardless of how good or imperfect DeepSWE is. The two claims are independent. You can believe that existing benchmarks have been systematically misfiling correct solutions and also believe that DeepSWE’s own leaderboard should be read with appropriate uncertainty about the harness. Both are true. The field has been running a race where the finish line was in the wrong place. DeepSWE has moved the line. The more durable contribution is not its own scores but the precedent it sets: purpose-written behavioral verifiers, contamination-free task construction, and a published audit of the benchmark being replaced. Every benchmark that aspires to replace SWE-bench Pro should be required to produce the same. The labs benchmarking their flagship models on internal evaluations should be asking whether their own verifiers have this problem too. The honest answer is that most of them haven’t checked.

AI-generated editorial illustration · TemperatureZero · May 27, 2026

Keep reading the signal

Get the Daily Signal — a concise briefing on what actually matters in AI and the systems around it.

Subscribe Free