Every time a new capability emerges in security tooling, the same question surfaces: does this replace the people? For agentic AI applied to penetration testing, the honest answer is no - and understanding why reveals something important about where human expertise actually produces value.
What automation is genuinely good at
Agentic testing systems excel at work that is systematic, repeatable, and exhausting for humans to sustain at scale. Specifically:
- Surface coverage: An agent will test every endpoint, every parameter, every HTTP method, every role combination, without getting bored or running low on time. A human tester under a five-day engagement constraint makes prioritisation decisions that leave surface untested. An agent does not.
- Consistency: The agent applies the same test methodology to the authentication flow on day one and the file upload handler on day four. Human testers are sharper earlier in an engagement and, under time pressure, increasingly likely to shortcut late-stage testing.
- Known vulnerability patterns: IDOR testing, injection variants, common misconfigurations, session management weaknesses - these are well-understood patterns that an agent can execute thoroughly and document correctly every time.
- Evidence collection: Every request, every response, every reproduction step, automatically captured with millisecond timestamps. The report writes its own evidence chain as the test runs.
These are real, meaningful capabilities. A programme that deploys agentic testing gets coverage that would otherwise require significantly more consultant time - and gets it on a cadence that matches deployment frequency rather than annual scheduling.
What humans are genuinely good at
Human security researchers bring capabilities that current AI systems demonstrably do not have, and which matter enormously for serious security work.
Novel attack research. The most impactful vulnerabilities found in real engagements are often ones that have never been documented before. A researcher who has spent years in a specific domain - OAuth implementations, WebAssembly runtimes, GraphQL query engines, mobile API patterns - has an intuition for where the bodies are buried that no training corpus can fully replicate. The creative leap from "this looks slightly odd" to "this is a new class of vulnerability" is a human skill.
Understanding business context. A healthcare application has different risk priorities than a fintech platform or a developer tooling company. The same technical finding - say, verbose error messages that include stack traces - has different severity depending on what information those traces reveal in context. Researchers who understand the business model understand what data matters, which flows carry real financial or regulatory risk, and how an attacker would actually monetise a finding.
Social and process vulnerabilities. Phishing simulations, pretexting, physical security assessments, and the human factors side of access control sit entirely outside what an automated web testing system can evaluate. The insider threat vector, supply chain trust relationships, and development process weaknesses require human judgement.
Interpreting ambiguous signals. When an application behaves in a way that is slightly unexpected but doesn't match a known vulnerability pattern, a researcher exercises judgement about whether this is a false positive, a quirk of the tech stack, or a novel attack surface. That judgement call draws on experience that is genuinely difficult to encode.
The force multiplier model in practice
The most effective security programmes treat agentic testing and human researchers as complementary - each covering the other's weaknesses.
In practical terms, a typical engagement looks like this: the agentic system runs first, producing a comprehensive map of the application surface and an initial findings report. This takes hours rather than days. The human researcher then receives that report as their starting brief.
Instead of spending the first day and a half of a five-day engagement doing reconnaissance and mapping, the researcher starts on day one with a complete picture of the application surface, all the low-hanging fruit already documented, and can immediately direct their attention to the areas where human judgement adds the most value - complex business logic, novel attack hypotheses, areas the agent flagged as interesting but ambiguous.
The researcher's output is correspondingly deeper. Rather than a report split between basic findings the agent would have caught anyway and the more interesting work, the entire engagement is spent on the work that actually requires expertise.
What this means for security team headcount
The honest framing here is that agentic testing does not reduce the number of security researchers an organisation needs - it changes what those researchers spend their time on.
Teams that adopt agentic testing as part of their programme find that their researchers can cover more applications, at higher depth, in the same time. An internal red team that previously ran two thorough assessments a quarter can run six - because the systematic coverage work is handled, freeing every engagement hour for the work that requires human skill.
For organisations that use external consultants, the dynamic is similar. Engagement costs can be directed at the high-value human component rather than at consultant hours spent on basic surface mapping. The budget goes further.
The skills that matter more, not less
If agentic testing handles the systematic layer, the security researchers who thrive are those who invest in the skills that automation cannot replicate. Deep protocol knowledge. Familiarity with emerging technology stacks before they become mainstream. The ability to model attacker incentives and reason about what a sophisticated adversary would actually target. Code review capability. The communication skills to turn technical findings into clear risk statements for executive audiences.
These skills were always valuable. They become more differentiating as the baseline coverage work gets automated. The floor rises; the ceiling for what skilled researchers can accomplish rises with it.
The researchers who should be concerned are those whose value proposition was primarily breadth coverage - testing a lot of endpoints thoroughly but without particular depth. That specific function is where agentic systems produce the most direct substitution. Everything above it remains firmly in human territory.
A realistic view
Agentic security testing is a genuinely powerful capability addition to the field. It makes thorough coverage economically feasible at cadences that manual-only programmes cannot sustain. It removes the consistency problems that come with time-pressured manual work. It produces better evidence and more reliable documentation.
It does not replace the researchers who bring deep expertise, creative attack thinking, and contextual judgement to hard security problems. It gives them better raw material to work with and more time to do the work that actually requires them.
That is what a force multiplier looks like.