Skip to main content
v2026.1714 entries · CC-BY 4.0
CASRAI

Definition · Plain-language

AI red teaming

AI red teaming is structured adversarial testing that deliberately probes an AI system to find flaws, harms and vulnerabilities.

CASRAI research-methods explainer — AI red teaming

The step most authors miss

Doing CRediT right? Don’t stop at the statement.

A CRediT statement credits you inside one paper. The recognition CRediT was built for happens when those roles are tied to you, persistently. Sign in with your ORCID — free — and claim your CRediT contributions on casrai.org, the home of the standard. They become a verified, portable part of your identity, not a line that disappears into one PDF.

Free: claim your contributions, then export a journal-ready CRediT statement, schema.org structured data, JATS XML, CSV or BibTeX — and preview your public profile. A membership publishes that profile publicly and verifies the journals you serve.

What red teaming does

Red teaming takes an adversarial stance: rather than confirming a system works as intended, testers actively try to make it fail in harmful ways. For generative AI this means attempting to elicit unsafe content, biased or discriminatory outputs, privacy leaks or instructions for misuse, and trying to bypass guardrails through techniques such as prompt injection or jailbreaking. Testers may be internal specialists, external experts or a mix, and may use manual probing, automated attack generation, or both. The output is a catalogue of discovered weaknesses that development and governance teams can then address.

Why it has become central

Generative AI behaves in open-ended ways that ordinary testing struggles to characterise, because the space of possible inputs and harmful outputs is vast. Red teaming addresses this by probing the edges where harm is most likely. It has consequently moved from a niche security practice to a mainstream governance expectation: NIST’s Generative AI Profile points to adversarial testing as part of managing generative-AI risk, and emerging regulation and policy increasingly look for evidence that high-impact systems have been red-teamed. It provides assurance that safeguards have been stress-tested, not merely declared.

Red teaming within governance

Red teaming is one technique within AI risk management and assurance, complementing audit and ongoing monitoring. Where an audit assesses a system against defined criteria and monitoring watches live behaviour, red teaming actively hunts for unknown failure modes. Findings feed back into the Manage and Govern activities: weaknesses are prioritised, safeguards strengthened, and documentation updated. Because new attack techniques and model behaviours emerge continually, red teaming is most effective as a recurring exercise rather than a single pre-launch event, especially for systems whose capabilities or exposure grow over time.

Key facts

At a glance

  • Definition: structured adversarial testing to find flaws, harms and vulnerabilities in AI
  • Origin: adapted from cybersecurity red teaming
  • Focus: especially generative AI (unsafe outputs, bias, jailbreaks)
  • Methods: manual probing, automated attacks, or both
  • Standards link: referenced in NIST’s Generative AI Profile
  • Cadence: most effective as a recurring, not one-off, exercise

Common misconceptions

What people often get wrong

Often heard: AI red teaming is just standard software testing.

Actually: Standard testing confirms intended behaviour; red teaming is adversarial, deliberately seeking harmful and out-of-scope failures such as unsafe content or bypassed safeguards. The mindset and techniques differ markedly.

Often heard: Red teaming is only about cybersecurity vulnerabilities.

Actually: For AI, red teaming also targets harmful, biased or policy-violating outputs and safety failures, not only technical security holes. It spans content, fairness and safety risks alongside security.

Often heard: A single red-teaming exercise makes a system safe.

Actually: New attack techniques and model behaviours emerge continually, so red teaming is most effective when repeated. One exercise reduces known risks but cannot guarantee future safety as the system and threats evolve.

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →