Aggregation methods are techniques used to combine multiple evaluators’ responses into a single, reliable outcome in tasks like LLM evaluations. Majority Vote selects the answer chosen by most annotators. Dawid-Skene is a probabilistic model that estimates both true answers and annotator reliability. MACE (Multi-Annotator Competence Estimation) infers the most likely true label and annotator expertise. These methods help improve the accuracy and trustworthiness of evaluation results.
Aggregation methods are techniques used to combine multiple evaluators’ responses into a single, reliable outcome in tasks like LLM evaluations. Majority Vote selects the answer chosen by most annotators. Dawid-Skene is a probabilistic model that estimates both true answers and annotator reliability. MACE (Multi-Annotator Competence Estimation) infers the most likely true label and annotator expertise. These methods help improve the accuracy and trustworthiness of evaluation results.
What is majority vote in crowdsourcing?
The simplest way to combine labels: for each item, assign the label that the most workers chose. It’s fast but ignores how reliable each worker is.
What is the Dawid-Skene model?
A probabilistic model that treats the true label as latent and models each worker’s labeling behavior with a confusion matrix. It uses an EM algorithm to jointly estimate the true labels and workers’ accuracies, often improving accuracy over majority vote.
What is MACE in aggregation methods?
A more advanced probabilistic approach that jointly reasons about annotator reliability and the evidence for each label, typically using an EM-like or Bayesian procedure to infer the most likely true labels and account for complex annotator behavior.
When should I use each method?
Use majority vote for simplicity and speed when you have many labels per item and reasonably reliable workers. Use Dawid-Skene when worker quality varies and you want a principled estimate of true labels. Use MACE when you need stronger handling of complex annotator behavior and have enough data to fit a more expressive model.