// agentic red team
· from $20/moAgentic auto-pentest.
With a human in the loop.
The autonomous-pentest category is mostly demo-ware that breaks on real engagements.
XBOW, PentestGPT, the wave of YC startups pitching `AI replaces the consultant` all demo well on benchmark targets and fall over on real environments where scope is ambiguous, the network is hardened, the bug class requires understanding the application, or anything outside the demo path goes sideways. The autonomous pitch optimises for the wrong axis — senior consultants are not the bottleneck the pitch assumes. The bottleneck is the boilerplate around the consultant. TartarusAI takes a different premise: agentic, but with the human in the loop on every escalation. Run your standard engagement playbooks (recon, enumeration, exploitation, lateral movement, post-ex) end-to-end with the verification gate keeping the agent honest at each step and a full audit trail of every action. You stay in the loop on every destructive action, every scope-boundary edge case, every escalation that requires senior judgement. Same enterprise-grade runtime safety guards. Content policy, off.
- ✓Zero refusals on offsec work
- ✓Custom-tuned model — not a wrapper
- ✓Runtime guards on by default
- ✓256K context, sub-2s TTFT
refund if it ever refuses · no card on file · crypto-only · cancel any time
❯ run the standard internal-network engagement playbook on 10.0.0.0/24
target: signed engagement, scope acme-corp internal
▎▣ recon: discovered 47 hosts, 312 services
▎▣ enum: found 18 candidate vulns, 4 high-severity
▎▣ exploit: 2 confirmed RCE, dropped beacons
▎+ wrote engagement_log.md (full audit trail)
done.// what it does
What you ship
when nothing refuses.
Playbook-driven execution
Bring your existing engagement playbooks (or use the standard ones bundled in); the agent executes step-by-step, asks for confirmation before destructive actions, pauses on ambiguity. Not a black-box. Standard playbooks for internal-network engagement, external perimeter assessment, web app audit, AD-focused engagement, cloud (AWS / Azure / GCP) audit, mobile app review.
Full audit trail
Every command, every decision, every escalation logged with reasoning. Required for client deliverables and post-engagement review. Cuts the worst part of pentest reporting — recreating what happened during the engagement when it is time to write the deliverable two weeks later.
Multi-host orchestration
The agent holds engagement state across hosts, services, and credentials. Pivoting works as you would expect; the loop guard prevents the same dead-end from being retried; the failed-path blacklist records what did not work so the agent does not waste your engagement budget on the same approach twice.
Human-gated escalations
Anything destructive, anything outside the explicit scope, anything that crosses a boundary — pauses for confirmation. You are the senior consultant; the agent is the operator who never sleeps. Configurable confirmation thresholds (auto-approve read-only enumeration, gate any state mutation, gate any cross-host pivot, gate any credential-dump action).
OPSEC-aware execution
Configurable rate limits, jitter on requests, sleep mask between operations, traffic shaping to avoid volumetric detection. The agent respects your OPSEC parameters; it does not blast 10K requests per second at the target network the way a naive scanner would.
Live engagement dashboard
Real-time view of what the agent is doing, what it has found, what is pending confirmation, what has been escalated. For consultancies running multiple concurrent engagements, the dashboard is the operations-team interface to the agent. For solo consultants, it is the safety net that lets you walk away from the keyboard for a meeting and come back to a meaningful checkpoint.
// philosophy
Why fully autonomous is the wrong default
The autonomous-pentest pitch sounds compelling: drop in the scope, get out the report, no human in the loop. The problem is that pentest engagements have legal and contractual consequences for mistakes. An autonomous tool that hits an out-of-scope asset has just exposed your client (and you) to liability. An autonomous tool that misclassifies a finding has just produced a false-positive in your deliverable that the client engineering team has to triage.
In practice, the autonomous tools that have shipped to date fall into two failure modes. The conservative ones under-execute — they stop on every ambiguity, requiring so much human input that the value-add over a senior consultant + their existing tooling is negative. The aggressive ones over-execute — they hit out-of-scope assets, exfiltrate data they should not have touched, or silently fail in ways that show up in the post-engagement client review as "this finding is wrong."
TartarusAI takes the position that human-in-the-loop is the right default for billed engagements. The agent does the boilerplate at machine speed. The senior consultant makes the judgement calls that have legal and contractual consequences. The audit trail captures both layers so the deliverable is defensible in the post-engagement review.
// workflow
A typical auto-pentest run
Pre-engagement: you load the playbook (standard or custom), define the scope (in-scope hosts, out-of-scope hosts and paths, time windows for activity, OPSEC parameters), and configure the confirmation thresholds (which actions auto-approve, which require human confirmation, which are categorically gated). The engagement parameters live as structured data you commit to your engagement repo.
During the engagement: the agent executes the playbook step-by-step. Recon and enumeration tend to run unattended (low risk, high boilerplate). Exploitation and post-exploitation tend to pause for human confirmation at every escalation (high risk, requires judgement). The live dashboard shows you what is running, what is pending, what has been found.
Post-engagement: the audit trail becomes the technical-deliverable source data. The agent ghostwrites the client report from the audit trail plus your annotated screenshots and notes. CVSS scoring per finding, severity prioritisation, executive summary, technical deep-dive, remediation suggestions calibrated to the client environment. Deliverable turnaround drops from days to hours without sacrificing report quality.
// comparison
Versus the autonomous-pentest category
XBOW: heavily marketed, well-funded, demos well on standardised benchmarks, struggles on real engagements that deviate from the demo path. Best for organisations whose pentest needs map cleanly to standardised benchmarks (typically cloud / web app audits with predictable structure).
PentestGPT and the open-source LLM-pentest projects: useful as research, generally not production-ready for billed engagements. Quality varies wildly with the underlying LLM and the prompt engineering.
The wave of YC startups pitching AI-replaces-the-consultant: most are at the seed stage with small engineering teams, limited operating history, and a thesis that has not survived contact with real engagements. Worth watching but not yet worth replacing your senior consultant with.
TartarusAI Auto-Pentest: agentic, but with the human in the loop on every escalation. Targets the consultancy and senior-individual segment that wants the agent to absorb boilerplate, not replace judgement. Different bet, different defaults.
// questions
What people actually ask.
Is this fully autonomous like XBOW or PentestGPT?+
Will it stop before doing something destructive?+
Can I use it on client engagements?+
Does it produce a usable report at the end?+
How does the playbook system work?+
Can it run multiple concurrent engagements?+
What about cloud engagements specifically?+
How does it handle authorisation boundaries dynamically?+
// ready
Stop fighting refusals.
Start shipping the engagement.
One tier covers most engagements at $20/month. If the agent ever refuses, hedges, or returns neutered output on legitimate engagement work, we refund — see the refund policy.
refund if it ever refuses · no card on file · crypto-only