March 2026·7 min read

The Evaluation Gap

Why the most consequential role in AI is the one nobody talks about

On August 2, 2025, the European Union's AI Act reached a critical enforcement milestone, bringing general-purpose AI obligations and high-risk system requirements into effect. Organizations deploying high-risk AI systems without adequate adversarial testing now face administrative fines of up to 35 million euros, or 7% of total worldwide annual turnover, whichever is higher. Across the Atlantic, the National Institute of Standards and Technology continues to develop AI evaluation frameworks, and multiple U.S. states have enacted or proposed their own AI governance legislation.

These are not voluntary guidelines. They are enforceable requirements. And they have created a problem that the AI industry has not solved: who, exactly, is qualified to evaluate the most powerful systems ever built?

A Market Without a Workforce

The market for AI red teaming services is growing rapidly. Research and Markets projects the sector will reach $4.8 billion by 2029, expanding at a 28.6% compound annual growth rate. Broader estimates from Dataintelo place the total AI red teaming market, including automated tooling, at over $15 billion by 2033. Regardless of which projection proves most accurate, the trajectory is clear: demand for adversarial evaluation is scaling faster than nearly any other category in AI infrastructure.

But the supply side is nearly empty. There is no university degree in "AI red teaming." There is no standardized certification. The people who are genuinely good at this work come from scattered backgrounds: cybersecurity, cognitive science, linguistics, philosophy, competitive hacking. Most of them do not yet know this field exists as a career path.

This is the evaluation gap: a rapidly growing demand for a skill set that the traditional talent pipeline does not produce.

What the Labs Have Learned

Every major AI laboratory has arrived at the same conclusion through different paths. OpenAI built its Red Teaming Network by recruiting domain experts from dozens of fields, not just computer scientists, but chemists, political scientists, child safety advocates, and linguists. Their published selection criteria state explicitly: "Demonstrated expertise or experience in a particular domain relevant to red teaming" is valued above prior AI experience. The network prioritizes diverse perspectives over technical homogeneity.

Anthropic's Frontier Red Team focuses on the highest-stakes scenarios: biosecurity, autonomous systems, and strategic deception. Their published work on challenges in red teaming AI systems describes how the most dangerous model behaviors emerge not from simple prompts but from sustained, context-rich interactions that require deep domain knowledge to even recognize as harmful. They have also documented the iterative process of moving from qualitative human red teaming to quantitative automated evaluations, a pipeline that depends on human expertise at every stage.

DeepMind's Frontier Safety Framework treats the assessment of AI systems as a rigorous discipline in its own right, with its own methodology, standards, and failure modes. Meta's Purple Llama initiative has open-sourced safety benchmarks while acknowledging that static benchmarks alone are insufficient. The common thread across all of these efforts: every lab that has invested seriously in safety has concluded that human evaluators with diverse expertise are not optional.

The Scale Problem

In large-scale AI red teaming competitions, researchers have identified more than 60,000 unique exploit scenarios across frontier models. That is not a number achievable with a small internal team. It requires a distributed network of evaluators, each bringing different knowledge, different intuitions, and different attack strategies.

The economics of adversarial evaluation are unusual. Unlike most AI work, where scale means more compute, adversarial evaluation scales with cognitive diversity. Ten evaluators with identical backgrounds will find fewer vulnerabilities than five evaluators from five different fields. The value is not in volume; it is in variance.

This is why traditional hiring does not work for this problem. You cannot write a job description for "person who thinks about AI failure in ways we have not imagined yet." You have to build a system that identifies that capacity directly.

The Independent Evaluator Advantage

There is a structural reason why the best adversarial evaluators tend to work independently rather than as full-time employees at any single lab. Red teaming requires fresh eyes. An evaluator who has spent months inside one organization's safety framework begins to internalize its assumptions, the very assumptions they are supposed to challenge.

Independent contractors who work across multiple systems, multiple organizations, and multiple domains maintain what organizational psychologists describe as an outsider perspective. They see patterns that insiders miss. They bring attack strategies from one context into another. They are less likely to self-censor because they have not been socialized into the team's blind spots.

This is not a flaw in the talent model. It is a feature. The best evaluators should be independent precisely because independence is what makes their perspective valuable.

Closing the Gap

PANTERAI exists because we believe the evaluation gap is one of the most important unsolved problems in AI deployment. Not alignment in the abstract. Not capabilities research. The concrete, practical question of whether the systems being deployed into the world actually work the way they are supposed to, and what happens when they do not.

We are building a network of capable adversarial evaluators. Not by asking for resumes. Not by filtering for pedigree. By presenting challenges that only the right kind of mind can solve, and letting demonstrated ability speak for itself.

If you have ever found a vulnerability that nobody else saw. If you have ever broken a system by thinking about it from an angle its designers never considered. If you have ever looked at a safety benchmark and thought "this does not test what actually matters," you may be exactly who we are looking for.

The AI industry is spending billions on building models and comparatively little on understanding whether they work. That ratio is shifting, not because the industry chose it, but because regulators, users, and reality demand it. The question is not whether adversarial evaluation will become a major discipline. It is whether there will be enough qualified people to do it when the moment arrives.

Apply Now ← Back to Blog