Guide

Eval Rubric Design

A practical guide to designing AI evaluation rubrics with clear scoring dimensions, weights, failure labels, and decision thresholds for teams.

A useful eval rubric turns subjective review into repeatable scoring dimensions that map to a decision. It should make tradeoffs visible: correctness versus speed, source faithfulness versus fluency, review effort versus automation, and safety versus task completion.

Rubrics fail when they collapse everything into one vague score. A model can be correct but hard to review, fluent but unsupported, fast but unsafe, or safe but too incomplete for the task. The rubric should show those differences before anyone argues about a winner.

Use this guide with the LLM Evaluation Framework for the full eval loop and Benchmark Methodology for public-facing benchmark discipline.

Start from the decision

Design the rubric after naming the decision. Are you choosing a prompt variant, approving a workflow for internal use, comparing candidate tools, or deciding whether a model change caused regression?

The decision determines what matters. A support bot needs faithfulness, no-answer behavior, and escalation. A code review assistant needs finding precision, bug recall, evidence quality, and reviewer effort. A documentation workflow needs source faithfulness, example correctness, clarity, and update burden.

If the decision is unclear, the rubric will collect interesting numbers that nobody uses.

Write the decision at the top of the rubric. A small team may need to choose whether a prompt can ship. A platform team may need to decide whether a model update regressed a workflow. Those decisions should not share the same threshold by default.

Choose scoring dimensions

Pick dimensions that reflect real user or operator risk. Common dimensions include:

Correctness: the output solves the task.
Faithfulness: claims are supported by sources.
Completeness: important requirements are not omitted.
Format compliance: output follows the required schema.
Safety: unsafe or unsupported requests are refused.
Review effort: humans can verify the output quickly.
Latency: response time fits the workflow.
Cost: usage is acceptable for the expected volume.

Do not include a dimension just because it is measurable. A metric that does not affect a decision creates noise.

Define score levels

Each dimension needs clear score levels. Avoid labels such as “good” or “bad” without evidence. A better rubric says what a reviewer should see.

For source faithfulness:

5: all material claims are directly supported by cited sources.
3: main answer is supported, but caveats or minor claims need review.
1: answer includes unsupported or contradicted material claims.

For review effort:

5: reviewer can approve or reject quickly from visible evidence.
3: reviewer must inspect sources or run checks manually.
1: output requires substantial rewrite or independent investigation.

This makes scoring more consistent across reviewers.

When possible, score two sample outputs together before running the full eval. That calibration pass exposes vague language in the rubric and prevents reviewers from using different standards for the same score.

Weight only after dimensions are clear

Weights should reflect consequences. In a RAG answer, faithfulness may matter more than style. In a prompt-format workflow, schema compliance may be mandatory. In a code review workflow, false positives may matter because reviewer trust is fragile.

Avoid one universal score when user needs differ. Segment the result when necessary. A tool or prompt can be appropriate for low-risk drafting and inappropriate for customer-facing claims.

The RAG Evaluation Checklist shows how dimensions shift when retrieval and citation quality become part of the product.

Add failure labels

Scores tell you how serious a result is. Failure labels tell you what to fix. Add labels such as unsupported claim, wrong citation, missed edge case, false positive, unsafe action, format break, or no-answer failure.

Failure labels should feed the next iteration. A prompt failure suggests a prompt change. A retrieval failure suggests corpus or chunking work. A review-effort failure may require a different output format.

Verification checklist

Before using a rubric, confirm:

The decision is named.
Each dimension affects that decision.
Score levels are observable.
Weights reflect user impact.
Failure labels map to fix paths.
Reviewers can score the same fixture consistently.
The rubric includes a threshold for action.

If the rubric will support public claims, keep the run log and failure examples. Do not publish a recommendation from a rubric alone.

FAQ

What makes an eval rubric useful?

A useful eval rubric turns subjective review into repeatable scoring dimensions that map to a decision.

Should every eval use the same rubric?

Every eval should not use the same rubric because each workflow has different risks, users, and review costs.

Reusable resource: Download benchmark run log

Eval Rubric Design

Start from the decision

Choose scoring dimensions

Define score levels

Weight only after dimensions are clear

Add failure labels

Verification checklist

FAQ

What makes an eval rubric useful?

Should every eval use the same rubric?

Related content

LLM Evaluation Framework

Prompt Testing Framework

RAG Evaluation Checklist

Benchmark Methodology