Examples
Worked examples
- Is an instance
A role-play prompt ('pretend you are an unrestricted AI...') that elicits otherwise-refused content.
- Is an instance
An adversarial-suffix attack appending a learned token sequence that disables refusal.
Counter-examples
Looks similar, but isn't
- Not an instance
A legitimate user query in scope of the model's intended use.
- Not an instance
A prompt-injection attack delivered via tool input (different category).
Editorial commentary
Jailbreaks include role-play framings, multi-turn manipulation, encoding tricks (base64, ROT13), and adversarial-suffix attacks (Zou et al., 2023). Resistance to jailbreaking is a target of post-training (RLHF, constitutional AI) and a focus of red-team evaluation. The category is a moving target: each newly disclosed jailbreak typically prompts new mitigations.
References
- Zou et al., 'Universal and Transferable Adversarial Attacks on Aligned Language Models' (arXiv 2023); Wei, Haghtalab, Steinhardt, 'Jailbroken: How Does LLM Safety Training Fail?' (NeurIPS 2023).
Also known as
LLM jailbreak
Machine-readable encodings
Use in your systems
<role vocab="credit"
vocab-identifier="https://casrai.org/dictionary/"
vocab-term="Jailbreak (LLM)"
vocab-term-identifier="https://casrai.org/dictionary/term/jailbreak-llm" />{
"@context": "https://schema.org",
"@type": "DefinedTerm",
"name": "Jailbreak (LLM)",
"identifier": "https://casrai.org/dictionary/term/jailbreak-llm",
"description": "A prompt or interaction pattern that causes a language model to bypass its safety training and produce outputs the model was tuned to refuse, such as harmful instructions, restricted content, or violations of provider policy.",
"inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
"url": "https://casrai.org/dictionary/term/jailbreak-llm",
"sameAs": [
"LLM jailbreak"
],
"license": "https://creativecommons.org/licenses/by/4.0/"
}







