Pairwise Preference Testing is a method where two model outputs are compared, and evaluators select the preferred one. The Bradley-Terry Model is a statistical approach used to analyze such pairwise comparisons, assigning scores to each option based on their likelihood of being preferred. In LLM evaluations, these methods help quantify and rank model performance by systematically aggregating human or automated preferences between outputs, resulting in more robust and interpretable assessments.
Pairwise Preference Testing is a method where two model outputs are compared, and evaluators select the preferred one. The Bradley-Terry Model is a statistical approach used to analyze such pairwise comparisons, assigning scores to each option based on their likelihood of being preferred. In LLM evaluations, these methods help quantify and rank model performance by systematically aggregating human or automated preferences between outputs, resulting in more robust and interpretable assessments.
What is pairwise preference testing?
A method where participants compare two items at a time and indicate which they prefer, producing data to infer overall preferences or rankings.
What is the Bradley-Terry model?
A probabilistic model for pairwise choices. Each item i has a positive weight w_i, and the probability that i is preferred over j is w_i/(w_i + w_j). The log-odds difference is log(w_i) − log(w_j).
How are the parameters of a Bradley-Terry model estimated?
Parameters are estimated by maximum likelihood using the observed pairwise outcomes, typically via iterative optimization methods or specialized software.
What are common assumptions and identifiability issues?
Assumes independence of comparisons given the item strengths and a single latent strength per item. To identify parameters, constrain the scale (e.g., sum of weights equals 1 or fix a reference item).
Where are Bradley-Terry models used?
In ranking or selecting among items from pairwise data—such as product preferences, sports matchups, or learning user preferences.