Skip to main content
v2026.1714 entries · CC-BY 4.0
Dictionary termTrack CStablev2026.2

Jailbreak (LLM)

A prompt or interaction pattern that causes a language model to bypass its safety training and produce outputs the model was tuned to refuse, such as harmful instructions, restricted content, or violations of provider policy.

ByCASRAI Editorial Board
· Last updated 21 May 2026

Examples

Worked examples

  • Is an instance

    A role-play prompt ('pretend you are an unrestricted AI...') that elicits otherwise-refused content.

  • Is an instance

    An adversarial-suffix attack appending a learned token sequence that disables refusal.

Counter-examples

Looks similar, but isn't

  • Not an instance

    A legitimate user query in scope of the model's intended use.

  • Not an instance

    A prompt-injection attack delivered via tool input (different category).

Editorial commentary

Jailbreaks include role-play framings, multi-turn manipulation, encoding tricks (base64, ROT13), and adversarial-suffix attacks (Zou et al., 2023). Resistance to jailbreaking is a target of post-training (RLHF, constitutional AI) and a focus of red-team evaluation. The category is a moving target: each newly disclosed jailbreak typically prompts new mitigations.

References

  • Zou et al., 'Universal and Transferable Adversarial Attacks on Aligned Language Models' (arXiv 2023); Wei, Haghtalab, Steinhardt, 'Jailbroken: How Does LLM Safety Training Fail?' (NeurIPS 2023).

Also known as

LLM jailbreak

Machine-readable encodings

Use in your systems

JATS XML <role> element
xml
<role vocab="credit"
      vocab-identifier="https://casrai.org/dictionary/"
      vocab-term="Jailbreak (LLM)"
      vocab-term-identifier="https://casrai.org/dictionary/term/jailbreak-llm" />
Schema.org DefinedTerm (JSON-LD)
json
{
  "@context": "https://schema.org",
  "@type": "DefinedTerm",
  "name": "Jailbreak (LLM)",
  "identifier": "https://casrai.org/dictionary/term/jailbreak-llm",
  "description": "A prompt or interaction pattern that causes a language model to bypass its safety training and produce outputs the model was tuned to refuse, such as harmful instructions, restricted content, or violations of provider policy.",
  "inDefinedTermSet": "https://casrai.org/dictionary/domain/ai-and-ml-research-outputs/",
  "url": "https://casrai.org/dictionary/term/jailbreak-llm",
  "sameAs": [
    "LLM jailbreak"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/"
}

Adopted by research universities worldwide

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoMassachusetts Institute of Technology logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • Massachusetts Institute of Technology logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo

View CASRAI adoption →