Question 1

What is an LLM judge?

Accepted Answer

An LLM judge is a language model used to evaluate or score answers, outputs, or behaviors in a task, typically by prompting the model to rate quality, correctness, or alignment with criteria.

Question 2

What does calibration mean in the context of LLM judges?

Accepted Answer

Calibration refers to how well the model's scored probabilities reflect real outcomes. A well-calibrated judge’s 0.7 score should correspond to about 70% of such cases being correct.

Question 3

What is agreement when evaluating LLM judgments?

Accepted Answer

Agreement measures how consistently different judges rate the same item. High agreement means judges give similar scores; low agreement indicates subjectivity or ambiguity.

Question 4

How is the reliability of LLM judges assessed?

Accepted Answer

Reliability is evaluated with metrics like inter-judge agreement (e.g., Cohen's or Fleiss' kappa), calibration measures, and correlations with gold standards, plus analyses of prompts and settings that affect scores.

Reliability of LLM Judges: Calibration and Agreement

Reliability of LLM Judges: Calibration and Agreement

💡 Key Takeaways

❓ Frequently Asked Questions