March 2026·8 min read

The Adversarial Mindset

Why the most effective red teamers think in ways no automated system can replicate

In October 2025, researchers from Anthropic, the UK AI Safety Institute, Oxford's OATML group, and the Alan Turing Institute published a joint study demonstrating that injecting just 250 carefully crafted malicious documents into a language model's pretraining corpus was sufficient to reliably backdoor models of any size. Not thousands. Not millions. Two hundred and fifty documents, representing 0.00016% of the training data for a 13-billion parameter model.

The finding illustrates something the security community has understood for decades but the AI industry is only beginning to internalize: the most dangerous vulnerabilities are not found by running automated scans. They are found by people who think differently than the systems they are testing.

The Automation Paradox

There is a tempting narrative in AI safety: that red teaming itself can be fully automated. Use one model to attack another. Scale adversarial testing the same way the industry scales everything else in machine learning, with more compute.

The evidence tells a more nuanced story. According to data compiled by Mindgard in their 2026 benchmarking report, automated red teaming achieves a 69.5% success rate at finding vulnerabilities, compared to 47.6% for manual testing alone. At first glance, that looks like a clear win for automation. But the same analysis found that automated methods identified 37% more unique vulnerability types when combined with human testers than when run in isolation. The operative word is "combined." Automation without human direction converges on the same attack patterns. It finds what it has been trained to find. Human red teamers find what nobody expected.

The Cognitive Science of Breaking Things

What makes a great adversarial evaluator is not technical skill alone, though that matters. It is a particular mode of cognition: the ability to simultaneously understand how a system is supposed to work and imagine how it might fail. This requires holding two contradictory mental models at once, the designer's intent and the attacker's opportunity.

Research in cognitive flexibility shows that this kind of dual-perspective reasoning is genuinely rare. It draws on divergent thinking (generating unexpected possibilities), analogical transfer (applying knowledge from one domain to another), and what security researchers sometimes call "attacker empathy," the ability to model the goals and constraints of an adversary who does not yet exist.

This helps explain why certain attack categories are so effective against language models. In a study focused on healthcare AI, tested models successfully completed adversarial objectives 94.4% of the time. Humane Intelligence found that simple cognitive manipulation attacks, such as "bad math" misdirection, succeeded 76% of the time. The human attacker does not need to understand the model's architecture. They need to understand its behavioral tendencies, and then subvert them. That is a fundamentally creative act, not a computational one.

Multi-Turn Attacks as Strategic Games

Perhaps the most revealing finding in recent adversarial research concerns multi-turn attacks. The SIEGE framework, published at a leading AI venue, demonstrated a 97% jailbreak success rate on GPT-4 in a single multi-turn run. Separate work from Giskard's GOAT framework confirmed similar results: success rates reaching 97% on smaller models and 88% on GPT-4-Turbo within five conversational turns. Single-turn attacks are far less effective.

The implication is significant. The most dangerous attacks against AI systems are not clever one-liners. They are strategic conversations. Each turn builds context, establishes trust, shifts the frame of reference. The attacker is not trying to overpower the model's safety training. They are trying to make the model forget it is supposed to be safe. This requires patience, theory of mind, and an intuitive understanding of how conversational dynamics shape behavior, skills that are distinctly human and extraordinarily difficult to automate.

The broader red teaming community has observed that the most effective adversarial evaluators often come from unexpected backgrounds: improvisational theater, investigative journalism, competitive debate, clinical psychology. What these fields share is deep practice in navigating ambiguity and exploiting the gap between what someone says and what they mean.

The Difficulty Gradient

Not all adversarial tasks are created equal. Benchmarking data shows a steep difficulty gradient: easy adversarial tasks succeed 31.7% of the time, medium tasks 10.7%, and hard tasks just 1.8%. The gap between "medium" and "hard" is where the real value lives, and it is precisely where automated approaches struggle most.

Hard adversarial tasks require what might be called compositional creativity: combining multiple attack vectors, domain knowledge, and contextual awareness in ways that have never been tried before. A biologist who understands protein folding might craft a prompt that a computer scientist never would. A linguist might exploit code-switching between languages in ways that monolingual testers cannot conceive. The most effective red teaming is inherently interdisciplinary.

What We Look For

At PANTERAI, we do not recruit red teamers by scanning resumes for keywords. We recruit by presenting challenges that only a certain kind of mind can solve. Our evaluation process is itself an adversarial test, not of credentials, but of the capacity to think in ways that surface what others miss.

We are looking for people who read a safety policy and immediately see the gaps. People who can hold a five-turn conversation with a model and make it contradict its own guidelines without ever saying anything explicitly prohibited. People who understand that the most effective attack is not the loudest one; it is the one nobody thought to test for.

The future of AI safety will not be decided by the models themselves. It will be decided by the people who know how to break them, and who have the discipline to do it systematically. The adversarial mindset is not a skill you learn from a textbook. It is a way of seeing the world that the best red teamers have always had.

Apply Now ← Back to Blog