LLM Evaluation Framework
A practical LLM evaluation framework for testing correctness, faithfulness, format compliance, safety, latency, and human review effort before launch.
Guide
A practical guide to designing AI evaluation rubrics with clear scoring dimensions, weights, failure labels, and decision thresholds for teams.
A useful eval rubric turns subjective review into repeatable scoring dimensions that map to a decision. It should make tradeoffs visible: correctness versus speed, source faithfulness versus fluency, review effort versus automation, and safety versus task completion.
Rubrics fail when they collapse everything into one vague score. A model can be correct but hard to review, fluent but unsupported, fast but unsafe, or safe but too incomplete for the task. The rubric should show those differences before anyone argues about a winner.
Use this guide with the LLM Evaluation Framework for the full eval loop and Benchmark Methodology for public-facing benchmark discipline.
Design the rubric after naming the decision. Are you choosing a prompt variant, approving a workflow for internal use, comparing candidate tools, or deciding whether a model change caused regression?
The decision determines what matters. A support bot needs faithfulness, no-answer behavior, and escalation. A code review assistant needs finding precision, bug recall, evidence quality, and reviewer effort. A documentation workflow needs source faithfulness, example correctness, clarity, and update burden.
If the decision is unclear, the rubric will collect interesting numbers that nobody uses.
Write the decision at the top of the rubric. A small team may need to choose whether a prompt can ship. A platform team may need to decide whether a model update regressed a workflow. Those decisions should not share the same threshold by default.
Pick dimensions that reflect real user or operator risk. Common dimensions include:
Do not include a dimension just because it is measurable. A metric that does not affect a decision creates noise.
Each dimension needs clear score levels. Avoid labels such as “good” or “bad” without evidence. A better rubric says what a reviewer should see.
For source faithfulness:
For review effort:
This makes scoring more consistent across reviewers.
When possible, score two sample outputs together before running the full eval. That calibration pass exposes vague language in the rubric and prevents reviewers from using different standards for the same score.
Weights should reflect consequences. In a RAG answer, faithfulness may matter more than style. In a prompt-format workflow, schema compliance may be mandatory. In a code review workflow, false positives may matter because reviewer trust is fragile.
Avoid one universal score when user needs differ. Segment the result when necessary. A tool or prompt can be appropriate for low-risk drafting and inappropriate for customer-facing claims.
The RAG Evaluation Checklist shows how dimensions shift when retrieval and citation quality become part of the product.
Scores tell you how serious a result is. Failure labels tell you what to fix. Add labels such as unsupported claim, wrong citation, missed edge case, false positive, unsafe action, format break, or no-answer failure.
Failure labels should feed the next iteration. A prompt failure suggests a prompt change. A retrieval failure suggests corpus or chunking work. A review-effort failure may require a different output format.
Before using a rubric, confirm:
If the rubric will support public claims, keep the run log and failure examples. Do not publish a recommendation from a rubric alone.
A useful eval rubric turns subjective review into repeatable scoring dimensions that map to a decision.
Every eval should not use the same rubric because each workflow has different risks, users, and review costs.
Reusable resource: Download benchmark run log