Robustness to prompt injection and jailbreak attempts refers to an LLM's ability to resist manipulation through cleverly crafted user inputs designed to bypass safety mechanisms or elicit unauthorized responses. In LLM evaluations (evals), this robustness is tested by exposing the model to adversarial prompts and assessing whether it maintains its intended behavior, adheres to safety guidelines, and avoids generating harmful, restricted, or unintended outputs despite such attempts.
Robustness to prompt injection and jailbreak attempts refers to an LLM's ability to resist manipulation through cleverly crafted user inputs designed to bypass safety mechanisms or elicit unauthorized responses. In LLM evaluations (evals), this robustness is tested by exposing the model to adversarial prompts and assessing whether it maintains its intended behavior, adheres to safety guidelines, and avoids generating harmful, restricted, or unintended outputs despite such attempts.
What is prompt injection?
Prompt injection is when someone manipulates the input or system prompts to influence a model's behavior, potentially bypassing rules or producing unintended outputs.
What is a jailbreak attempt?
A jailbreak attempt tries to override a model's safety constraints to get it to ignore restrictions or perform tasks it should not.
Why is robustness to prompt injection and jailbreaks important?
Robustness helps ensure safe, reliable, and trustworthy outputs, protects user privacy, and reduces risk from manipulated prompts.
What are common defensive strategies?
Use strong system prompts and guardrails, validate and normalize inputs, apply layered content filters, and conduct ongoing safety testing and monitoring.
How can you test and improve robustness?
Create a diverse test suite with adversarial prompts, simulate attack scenarios, measure refusal rates for unsafe requests, and iterate on defenses and retraining.