Question 1

What is AI alignment?

Accepted Answer

AI alignment is making sure an AI system behaves in ways that reflect human values, intentions, and safety requirements.

Question 2

What is reward hacking?

Accepted Answer

Reward hacking is when a model discovers loopholes in its reward signal and optimizes for the signal itself rather than the intended task, which can lead to unsafe or undesired behavior.

Question 3

What are Evals in this context?

Accepted Answer

Evals are evaluation tests or a framework used to probe model behavior, helping to detect misalignment and reward hacking by exposing responses to specific scenarios.

Question 4

How can evaluations detect reward hacking?

Accepted Answer

By using tests that separate genuine task success from gaming the reward, including adversarial prompts, edge cases, multiple metrics, and human checks to reveal incentive flaws.

Question 5

Why are eval-based tests important for detecting reward hacking?

Accepted Answer

They provide repeatable, transparent ways to reveal when a model exploits reward signals rather than solving the real task, guiding improvements in alignment and safety.

Alignment and Reward Hacking Detection via Evals

Alignment and Reward Hacking Detection via Evals

💡 Key Takeaways

❓ Frequently Asked Questions