Using an LLM as a judge can grade thousands of outputs in minutes — but only once you calibrate it against humans and neutralize four predictable biases. Here is the production playbook.
- What “LLM as a judge” means — and when to use it
- Pointwise, pairwise, or reference-based: pick the right scoring mode
- How to write a judge prompt that actually works
- Calibrate the judge against humans before you trust one score
- The four failure modes — and how to neutralize each
- Running LLM as a judge in production
- Builder’s take
- Frequently asked questions
- Is using an LLM as a judge reliable enough for production?
- Should I use pointwise or pairwise scoring?
- Can a model judge its own outputs?
- How many human labels do I need to calibrate an LLM judge?
- What temperature should the judge run at?
- Does asking the judge to reason first actually help?
- Primary sources
What “LLM as a judge” means — and when to use it
Using an LLM as a judge means handing a strong language model the output of another model and asking it to score or rank that output against a rubric — instead of relying only on human reviewers or rigid string-matching metrics. The approach went mainstream after Zheng and colleagues showed in 2023 that a capable judge agrees with human preferences more than 80% of the time, roughly the same rate at which two humans agree with each other.
Reach for an LLM judge when the thing you are measuring is open-ended and subjective: summaries, chat answers, retrieval-augmented (RAG) responses, agent trajectories, helpfulness, tone. These are exactly the cases where BLEU, ROUGE, or a regex fall apart because there is no single correct string.
Do not reach for a judge when you can check the answer deterministically. Did the JSON parse? Did the API return 200? Did the generated SQL run without error? Did the citation point to a real document? Use code for those — it is cheaper, faster, and never drifts. A judge is a measurement instrument with systematic error, and the rest of this playbook is about measuring that error and keeping it small enough to make decisions on.

An LLM judge takes (input, output, optional reference, rubric) and returns a score or a preference. It is not ground truth — it is a fast, cheap approximation of a human rater that you must calibrate before trusting.
Pointwise, pairwise, or reference-based: pick the right scoring mode
There are three ways to score with an LLM as a judge, and choosing the wrong one is the most common reason an eval pipeline produces numbers nobody trusts.
Pointwise grading asks the judge to assign one output a score — say 1 to 5 against a rubric. It is ideal for an always-on dashboard and longitudinal monitoring. Its weakness: scores compress when quality differences are small, and they drift slightly between runs even at temperature 0.
Pairwise grading shows the judge two outputs, A and B, and asks which is better. It is far more reliable for the real question behind a release — “is the new version better than the baseline?” — but you must run both orderings (A,B and B,A) to cancel position bias, which roughly doubles the call count.
Reference-based grading gives the judge a known-good answer and asks how close the output comes. When you have gold data, it turns a fuzzy judgment into a far higher-signal comparison.
Mature pipelines use all three at once: pointwise for the overnight dashboard, pairwise for the release-gate decision, and reference-based wherever a gold answer exists.
| Mode | Best for | Judge returns | Main weakness | Relative cost |
|---|---|---|---|---|
| Pointwise | Dashboards, monitoring over time | A score (e.g. 1–5) | Compresses on small gaps; drifts | 1× (cheapest) |
| Pairwise | Release gates, A/B model selection | A winner (A, B, or tie) | Needs both orderings; position bias | 2×+ |
| Reference-based | Anything with a gold answer | Closeness to reference | Requires curated gold data | 1–2× |
How to write a judge prompt that actually works
Three things separate an LLM as a judge prompt that reaches 96% agreement with humans from one stuck at 67% — and Evidently AI’s code-review case study walked exactly that path by changing only the prompt.
An explicit rubric. Define every score level concretely. “Rate the helpfulness 1–5” invites noise; spelling out what a 5 versus a 3 looks like removes it.
A constrained, structured output. Force the judge to answer in JSON with a fixed scale. Free-form prose verdicts are unparseable and inconsistent across runs.
Reasoning before the verdict. Ask the judge to think through its decision first. In the same case study, simply requiring the judge to explain its reasoning moved accuracy from 96% to 98%.
The template below puts all three together. Note two production details baked in: the judge runs at temperature=0 so identical inputs yield identical scores, and it is deliberately a different model family than whatever generated the answer (more on why in the failure-modes section).
Run the judge at temperature 0, or your scores wobble run-to-run for no reason. And always parse a structured field — if you regex a number out of prose, a chatty judge will eventually break your pipeline.
import json
from anthropic import Anthropic
client = Anthropic()
JUDGE_PROMPT = """You are a strict evaluator. Score ASSISTANT_ANSWER against the RUBRIC.
RUBRIC — faithfulness to the SOURCE:
5 Every claim is directly supported by the SOURCE.
4 All major claims supported; one minor detail unstated but plausible.
3 Mostly supported, but contains one unsupported claim.
2 Several unsupported or contradicted claims.
1 Largely fabricated or contradicts the SOURCE.
SOURCE:
{source}
ASSISTANT_ANSWER:
{answer}
First reason step by step: list each claim and whether the SOURCE supports it.
Then give the verdict. Respond ONLY with JSON: {{"reasoning": "...", "score": <1-5>}}"""
def judge(source: str, answer: str) -> dict:
msg = client.messages.create(
model="claude-sonnet-4-6", # judge from a DIFFERENT family than the generator
max_tokens=600,
temperature=0, # determinism: identical input -> identical score
messages=[{"role": "user",
"content": JUDGE_PROMPT.format(source=source, answer=answer)}],
)
return json.loads(msg.content[0].text)
Calibrate the judge against humans before you trust one score
80%+
Judge–human agreement
A strong judge matches human preference about as often as two humans agree (Zheng et al., 2023).
30–50
Labels to start
Representative hand-labeled examples to begin calibration — more is better.
0.6+
Cohen’s kappa
Threshold generally treated as substantial agreement between judge and humans.
An uncalibrated judge is a number generator. Calibration is the step that turns an LLM as a judge into a measurement you can defend in a launch review.
Step 1 — hand-label a gold set. Label 30–50 representative examples yourself (more is better). This is your ground truth, and building it forces you to make your own criteria explicit.
Step 2 — score the same set with the judge and compare. For a pass/fail label, track accuracy, precision, and recall. For a graded scale, use Cohen’s kappa, which measures agreement while correcting for the agreement you would get by chance — a kappa above 0.6 is generally considered substantial.
Step 3 — iterate the prompt, not the model. Evidently AI’s case study climbed from 67% accuracy (one-line prompt) to 96% (detailed rubric) to 98% (rubric plus required reasoning) without changing the judge model at all.
Know the ceiling you cannot beat: a judge can never exceed the rate at which humans agree with each other on the same task. If two experts agree only 65% of the time on “is this answer helpful,” no judge will hit 85% — and tuning toward that number just fits noise. Aim for consistent, repeatable verdicts, not perfect agreement, because humans disagree too.
When a judge and a deterministic check disagree, that gap is the most valuable data you have. Read those cases by hand first — they tell you whether the judge is wrong or your rubric is.
from sklearn.metrics import cohen_kappa_score, classification_report
# Your hand-labeled gold set vs the judge's scores on the SAME items
human = [5, 4, 2, 5, 1, 3, 4, 2, 5, 3]
judged = [5, 4, 3, 5, 1, 3, 4, 2, 4, 3]
# Cohen's kappa corrects for agreement that would happen by chance.
print("kappa:", round(cohen_kappa_score(human, judged), 3)) # >= 0.6 = substantial
# If you actually decide on a pass/fail gate (score >= 4), check that bucket:
print(classification_report(
[h >= 4 for h in human],
[j >= 4 for j in judged],
target_names=["fail", "pass"]))
The four failure modes — and how to neutralize each
Four biases show up so reliably when you run an LLM as a judge that you should assume each is present until you have measured otherwise. Expand each one below for the mechanism and the fix.
“Agreement is not alignment. A judge that matches your labels on style can still be blind to whether the answer is true.”
The core lesson of LLM-judge research
1. Position bias — the judge favors whichever answer it sees first
In pairwise comparisons the answer in the first slot can win 10–15 points more often regardless of quality. Fix: run both orderings (A,B and B,A) and only count a win if it holds in both directions; or randomize order across the dataset and aggregate. Never trust a single-ordering pairwise result.
2. Verbosity bias — longer answers score higher at equal quality
Judges treat length as a proxy for thoroughness, so a padded answer beats a tight one. Fix: add an explicit “do not reward length; penalize padding” clause to the rubric, normalize for token count, or supply a concise reference answer so the judge has a length anchor.
3. Self-preference & family bias — models rate their own family 10–25% higher
A GPT judge tends to prefer GPT outputs; a Claude judge prefers Claude. Fix: never let a model judge an A-vs-B comparison if its own family is on the ballot. Use a judge from a different provider, or for high-stakes calls run a three-judge cross-family ensemble (for example Claude, GPT, and Gemini) and take a majority vote.
4. Style over substance — fluent, confident prose beats correct-but-plain answers
Research on judge failure modes shows judges over-weight tone and under-weight factuality and safety — their scores can fail to correlate with world knowledge and instruction-following. Fix: use reference-guided grading, decompose into separate factuality / safety / style sub-judges, and always spot-check the items where the judge and a code-based check disagree.
Running LLM as a judge in production
Ensemble the high-stakes calls. For a launch or release gate, run three judges from three different families and aggregate by majority or weighted vote. As of 2026 a defensible default trio is Claude Sonnet 4.5, GPT-5.1, and Gemini 2.5 Pro — three families, so no single self-preference bias dominates.
Keep humans in the loop, cheaply. Sample 5–10% of live judge verdicts and have a person re-grade them. That sample is your drift alarm: if agreement on it falls, your model, your prompt, or your traffic has shifted and it is time to recalibrate.
Budget the cost honestly. Pairwise with both orderings, times a three-judge ensemble, is roughly six times the calls of a single pointwise score. Reserve that setup for release gates; run one cheap pointwise judge for the always-on dashboard.
Version everything. The judge prompt, the judge model, and the rubric are all part of the instrument. Pin them. When you change one, re-baseline your history — otherwise your quality trendline is comparing two different rulers.
Never let the model you are shipping grade itself, and never change the judge prompt and the candidate model in the same experiment. Hold the instrument fixed while you measure the thing.
Builder’s take
I run the evaluation layer for Cyntr’s agent pipeline, and the LLM as a judge setups that survived contact with production all shared one habit: we calibrated them against a human-labeled set before we trusted a single dashboard number. The ones where we skipped that step drifted quietly until a ‘green’ release shipped a regression nobody caught.
- Calibrate first, automate second — a judge you have not checked against 30–50 of your own labels is a random number generator with good marketing.
- Cross-family by default: we never let the model we are shipping grade its own outputs, because the self-preference tax is real and it always flatters the incumbent.
- Keep a 5–10% human re-grade sample running forever — it is cheap insurance and the only thing that reliably catches drift before users do.
- Treat the judge prompt and judge model like production code: pin and version them, and re-baseline the moment either one changes.
Frequently asked questions
Is using an LLM as a judge reliable enough for production?
Yes for subjective, open-ended quality once you have calibrated it — strong judges reach 80%+ agreement with humans. It is never appropriate for anything you can verify deterministically (valid JSON, HTTP status, SQL that runs); use code-based checks there.
Should I use pointwise or pairwise scoring?
Pointwise for monitoring dashboards over time, pairwise for release decisions and A/B model selection, and reference-based whenever you have gold answers. Most production pipelines run all three.
Can a model judge its own outputs?
Avoid it. Self-preference and family bias inflate a judge’s scores for its own family by roughly 10–25%. Use a judge from a different provider, or a cross-family ensemble for high-stakes calls.
How many human labels do I need to calibrate an LLM judge?
Start with 30–50 representative hand-labeled examples — more is better. Track Cohen’s kappa against the judge; above 0.6 is generally substantial agreement.
What temperature should the judge run at?
Zero. You want identical inputs to produce identical scores; any randomness in the judge becomes noise in your metric.
Does asking the judge to reason first actually help?
Yes. Requiring the judge to explain its reasoning before emitting a score measurably raises agreement with humans — in one case study it moved accuracy from 96% to 98%.
Primary sources
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., NeurIPS 2023) — arXiv
- Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking — arXiv
- Mitigating the Bias of Large Language Model Evaluation — arXiv
- How to align an LLM judge with human labels (hands-on tutorial) — Evidently AI
- LLM-as-a-Judge metrics & G-Eval — Confident AI / DeepEval
- Calibrating an LLM-as-Judge with human corrections — LangChain
Last updated: May 30, 2026. Related: Observability.