The reliability of LLM judges in evaluations refers to how consistently and accurately large language models assess tasks or responses. Calibration measures how closely the model’s confidence matches actual correctness, while agreement evaluates how often different LLMs or human evaluators reach the same conclusions. High calibration and agreement indicate trustworthy, reproducible judgments, making LLMs valuable tools for evaluating AI outputs, benchmarking models, and automating assessment processes in research and industry.
The reliability of LLM judges in evaluations refers to how consistently and accurately large language models assess tasks or responses. Calibration measures how closely the model’s confidence matches actual correctness, while agreement evaluates how often different LLMs or human evaluators reach the same conclusions. High calibration and agreement indicate trustworthy, reproducible judgments, making LLMs valuable tools for evaluating AI outputs, benchmarking models, and automating assessment processes in research and industry.
What is an LLM judge?
An LLM judge is a language model used to evaluate or score answers, outputs, or behaviors in a task, typically by prompting the model to rate quality, correctness, or alignment with criteria.
What does calibration mean in the context of LLM judges?
Calibration refers to how well the model's scored probabilities reflect real outcomes. A well-calibrated judge’s 0.7 score should correspond to about 70% of such cases being correct.
What is agreement when evaluating LLM judgments?
Agreement measures how consistently different judges rate the same item. High agreement means judges give similar scores; low agreement indicates subjectivity or ambiguity.
How is the reliability of LLM judges assessed?
Reliability is evaluated with metrics like inter-judge agreement (e.g., Cohen's or Fleiss' kappa), calibration measures, and correlations with gold standards, plus analyses of prompts and settings that affect scores.