

What does 'LLM-as-a-Judge' mean?
It means using a large language model to assess or score the outputs of another model, acting as a judging agent in evaluation tasks.
Why would you use a model to evaluate another model’s outputs?
LLMs can process many samples quickly, provide nuanced judgments on quality (accuracy, relevance, safety), and enable scalable evaluation that complements human review.
What evaluation methods can an LLM-based judge employ?
It can rate outputs on predefined criteria, compare to reference answers, perform pairwise preferences, generate explanations for scores, and be combined with human checks for reliability.
What are common challenges with LLM-based evaluation and how can you mitigate them?
Risks include bias, prompt sensitivity, inconsistency, and data leakage. Mitigate with diverse prompts, multiple judges, ensemble scoring, human validation, and reporting uncertainty.