Eval Rubric Design
A practical guide to designing AI evaluation rubrics with clear scoring dimensions, weights, failure labels, and decision thresholds for teams.
Guide
A practical LLM evaluation framework for testing correctness, faithfulness, format compliance, safety, latency, and human review effort before launch.
An LLM evaluation framework should include fixtures, scoring weights, failure categories, decision rules, and a retest cadence. Without those parts, teams usually end up comparing impressive demos instead of repeatable evidence.
The best score is not always the most fluent output. A concise answer with correct uncertainty may be better than a polished answer that invents missing facts. A slower answer may be acceptable for research and unacceptable for chat support. The framework should match the workflow, not a generic model leaderboard.
Use this guide with Eval Rubric Design when defining scoring, Prompt Testing Framework when comparing prompt variants, and RAG Evaluation Checklist when retrieval is part of the system.
Start by naming the task the model is expected to perform. “Evaluate the model” is too broad. Better task boundaries include source-backed summary, code review finding extraction, RAG answer generation, prompt-format compliance, support response drafting, or agent action planning.
For each task, define:
The decision is important. Are you choosing a prompt, selecting a model, approving a workflow for pilot, or monitoring regression after a model change? Different decisions need different evidence.
Write the decision in the eval document before running candidates. That one sentence prevents the eval from becoming an open-ended comparison exercise. It also helps reviewers reject attractive outputs that do not serve the workflow.
Fixtures are frozen cases that the system runs repeatedly. A good fixture set includes normal cases, edge cases, adversarial cases, and refusal cases. The point is not to cover everything. The point is to expose the failures that would change a launch, buying, or review decision.
For a source-backed writing workflow, fixtures should include missing source facts and ambiguous claims. For code review, include one clean diff to measure false positives. For RAG, include answerable and no-answer questions. For agent workflows, include permission boundaries and rollback cases.
Record the expected behavior before running the model. If the expected answer is written after reading the output, reviewers will rationalize plausible but unsupported results.
One overall score is risky because different workflows value correctness, review effort, safety, latency, and faithfulness differently. Score dimensions separately first, then decide whether a weighted total is useful.
Common dimensions:
Weights should reflect the user impact. A production support answer should weight faithfulness and refusal heavily. A brainstorming assistant may tolerate more variance but still needs clear boundaries.
Failure categories make evals actionable. Instead of writing “bad answer,” label the failure:
Each category should map to a fix path: retrieval improvement, prompt change, model setting, tool permission, human review, or product boundary. The LLM Output Verification Guide is useful for turning failures into review evidence.
An eval should say what happens next. Examples:
The decision rule prevents subjective cherry-picking. It also makes it easier to explain why a workflow is not ready yet.
Before trusting an LLM eval, confirm:
An LLM evaluation framework should include fixtures, scoring weights, failure categories, decision rules, and a retest cadence.
One overall score is risky because different workflows value correctness, review effort, safety, latency, and faithfulness differently.
Reusable resource: Generate prompt test fixtures