"Building Baselines and Oracles for Evals (LLM Evaluations)" refers to establishing standard reference points (baselines) and ideal performance measures (oracles) to assess large language models (LLMs). Baselines provide a minimum expected level of performance, often using simple methods or previous models, while oracles represent optimal or expert-level outputs. These benchmarks are essential for objectively comparing and improving LLMs, ensuring evaluations are meaningful and progress is measurable.
"Building Baselines and Oracles for Evals (LLM Evaluations)" refers to establishing standard reference points (baselines) and ideal performance measures (oracles) to assess large language models (LLMs). Baselines provide a minimum expected level of performance, often using simple methods or previous models, while oracles represent optimal or expert-level outputs. These benchmarks are essential for objectively comparing and improving LLMs, ensuring evaluations are meaningful and progress is measurable.
What is a baseline in evals?
A simple reference point or model used to judge improvements, such as a naive heuristic or a previous system.
What is an oracle in evaluations?
A source of truth that defines correct outputs—often the gold standard or a hypothetical perfect decision-maker used to bound performance.
Why use baselines and oracles in evaluation?
They provide context, help detect overfitting, and show how much improvement is gained beyond simple methods or ideal accuracy.
How should you choose and implement baselines?
Pick simple, task-relevant methods, include more than one baseline, and report evaluation metrics consistently.
How do you validate an oracle or ground-truth data?
Use high-quality labels, assess inter-annotator agreement, follow clear guidelines, and adjudicate disagreements.