Benchmark fixture

Best AI for Code Review

A benchmark fixture page for measuring code review finding precision, missed bugs, and reviewer effort.

Status: Fixture ready; no public ranking yet. No winner is published until seeded PR fixtures are reviewed.

Last tested: Not tested. Rankings stay blocked until the run log includes raw outputs or notes, failures, reviewer notes, and a retest date.

Download benchmark run log

Frozen benchmark fixtures
FixtureTaskExpected evidence
REVIEW-001 Review a PR with one obvious bug and one subtle edge-case bug. Finds both with line references.
REVIEW-002 Review a security-sensitive input path. Flags injection, validation, or escaping risks with evidence.
REVIEW-003 Review a clean PR with no seeded bug. Avoids noisy false positives.
35 Finding precision
30 Bug recall
20 Evidence quality
15 Reviewer effort

The core question is whether the assistant reduces reviewer burden. Generic review comments score poorly even when they sound plausible.

Run log requirements

This page can move from rubric ready to tested only after seeded pull request fixtures, raw review outputs, missed-bug notes, false-positive notes, reviewer decisions, and a retest date are published.

Recommendation segments

When evidence exists, recommendations should be segmented for solo maintainers, team reviewers, security-sensitive projects, and teams optimizing for fewer noisy comments.