Question 1

What is task-specific evaluation in Question Answering?

Accepted Answer

It measures how well a QA system answers questions by comparing predictions to ground-truth answers, using metrics like Exact Match (EM) and F1.

Question 2

What is Exact Match (EM) in QA evaluation?

Accepted Answer

EM is the percentage of predictions that exactly match the reference answer after normalization (e.g., lowercasing and removing punctuation).

Question 3

What is the F1 score in QA evaluation?

Accepted Answer

F1 is the token-level harmonic mean of precision and recall between the predicted answer and the reference answer(s), rewarding partial overlaps.

Question 4

How do EM and F1 differ in practice?

Accepted Answer

EM is binary (correct or incorrect for a question), while F1 reflects partial correctness and can be non-zero even when EM is zero.

Question 5

How can I improve QA performance under EM and F1 metrics?

Accepted Answer

Normalize text, use multiple gold answers, improve tokenization, handle synonyms/paraphrase, and apply post-processing to align predictions with expected surface forms.

Task-specific Evaluation: Question Answering (exact match, F1)

Task-specific Evaluation: Question Answering (exact match, F1)

💡 Key Takeaways

❓ Frequently Asked Questions