Skip to main content

// agentic red team

· from $20/mo

Agentic auto-pentest.
With a human in the loop.

The autonomous-pentest category is mostly demo-ware that breaks on real engagements.

XBOW, PentestGPT, the wave of YC startups pitching `AI replaces the consultant` all demo well on benchmark targets and fall over on real environments where scope is ambiguous, the network is hardened, the bug class requires understanding the application, or anything outside the demo path goes sideways. The autonomous pitch optimises for the wrong axis — senior consultants are not the bottleneck the pitch assumes. The bottleneck is the boilerplate around the consultant. TartarusAI takes a different premise: agentic, but with the human in the loop on every escalation. Run your standard engagement playbooks (recon, enumeration, exploitation, lateral movement, post-ex) end-to-end with the verification gate keeping the agent honest at each step and a full audit trail of every action. You stay in the loop on every destructive action, every scope-boundary edge case, every escalation that requires senior judgement. Same enterprise-grade runtime safety guards. Content policy, off.

  • Zero refusals on offsec work
  • Custom-tuned model — not a wrapper
  • Runtime guards on by default
  • 256K context, sub-2s TTFT

refund if it ever refuses · no card on file · crypto-only · cancel any time

auto-pentest run · live live
❯ run the standard internal-network engagement playbook on 10.0.0.0/24
   target: signed engagement, scope acme-corp internal
  ▎▣ recon: discovered 47 hosts, 312 services
  ▎▣ enum:  found 18 candidate vulns, 4 high-severity
  ▎▣ exploit: 2 confirmed RCE, dropped beacons
  ▎+ wrote engagement_log.md (full audit trail)
done.
256K context · sub-2s TTFT · MoE 30B / 3B-active

// what it does

What you ship
when nothing refuses.

Playbook-driven execution

Bring your existing engagement playbooks (or use the standard ones bundled in); the agent executes step-by-step, asks for confirmation before destructive actions, pauses on ambiguity. Not a black-box. Standard playbooks for internal-network engagement, external perimeter assessment, web app audit, AD-focused engagement, cloud (AWS / Azure / GCP) audit, mobile app review.

Full audit trail

Every command, every decision, every escalation logged with reasoning. Required for client deliverables and post-engagement review. Cuts the worst part of pentest reporting — recreating what happened during the engagement when it is time to write the deliverable two weeks later.

Multi-host orchestration

The agent holds engagement state across hosts, services, and credentials. Pivoting works as you would expect; the loop guard prevents the same dead-end from being retried; the failed-path blacklist records what did not work so the agent does not waste your engagement budget on the same approach twice.

Human-gated escalations

Anything destructive, anything outside the explicit scope, anything that crosses a boundary — pauses for confirmation. You are the senior consultant; the agent is the operator who never sleeps. Configurable confirmation thresholds (auto-approve read-only enumeration, gate any state mutation, gate any cross-host pivot, gate any credential-dump action).

OPSEC-aware execution

Configurable rate limits, jitter on requests, sleep mask between operations, traffic shaping to avoid volumetric detection. The agent respects your OPSEC parameters; it does not blast 10K requests per second at the target network the way a naive scanner would.

Live engagement dashboard

Real-time view of what the agent is doing, what it has found, what is pending confirmation, what has been escalated. For consultancies running multiple concurrent engagements, the dashboard is the operations-team interface to the agent. For solo consultants, it is the safety net that lets you walk away from the keyboard for a meeting and come back to a meaningful checkpoint.

// philosophy

Why fully autonomous is the wrong default

The autonomous-pentest pitch sounds compelling: drop in the scope, get out the report, no human in the loop. The problem is that pentest engagements have legal and contractual consequences for mistakes. An autonomous tool that hits an out-of-scope asset has just exposed your client (and you) to liability. An autonomous tool that misclassifies a finding has just produced a false-positive in your deliverable that the client engineering team has to triage.

In practice, the autonomous tools that have shipped to date fall into two failure modes. The conservative ones under-execute — they stop on every ambiguity, requiring so much human input that the value-add over a senior consultant + their existing tooling is negative. The aggressive ones over-execute — they hit out-of-scope assets, exfiltrate data they should not have touched, or silently fail in ways that show up in the post-engagement client review as "this finding is wrong."

TartarusAI takes the position that human-in-the-loop is the right default for billed engagements. The agent does the boilerplate at machine speed. The senior consultant makes the judgement calls that have legal and contractual consequences. The audit trail captures both layers so the deliverable is defensible in the post-engagement review.

// workflow

A typical auto-pentest run

Pre-engagement: you load the playbook (standard or custom), define the scope (in-scope hosts, out-of-scope hosts and paths, time windows for activity, OPSEC parameters), and configure the confirmation thresholds (which actions auto-approve, which require human confirmation, which are categorically gated). The engagement parameters live as structured data you commit to your engagement repo.

During the engagement: the agent executes the playbook step-by-step. Recon and enumeration tend to run unattended (low risk, high boilerplate). Exploitation and post-exploitation tend to pause for human confirmation at every escalation (high risk, requires judgement). The live dashboard shows you what is running, what is pending, what has been found.

Post-engagement: the audit trail becomes the technical-deliverable source data. The agent ghostwrites the client report from the audit trail plus your annotated screenshots and notes. CVSS scoring per finding, severity prioritisation, executive summary, technical deep-dive, remediation suggestions calibrated to the client environment. Deliverable turnaround drops from days to hours without sacrificing report quality.

// comparison

Versus the autonomous-pentest category

XBOW: heavily marketed, well-funded, demos well on standardised benchmarks, struggles on real engagements that deviate from the demo path. Best for organisations whose pentest needs map cleanly to standardised benchmarks (typically cloud / web app audits with predictable structure).

PentestGPT and the open-source LLM-pentest projects: useful as research, generally not production-ready for billed engagements. Quality varies wildly with the underlying LLM and the prompt engineering.

The wave of YC startups pitching AI-replaces-the-consultant: most are at the seed stage with small engineering teams, limited operating history, and a thesis that has not survived contact with real engagements. Worth watching but not yet worth replacing your senior consultant with.

TartarusAI Auto-Pentest: agentic, but with the human in the loop on every escalation. Targets the consultancy and senior-individual segment that wants the agent to absorb boilerplate, not replace judgement. Different bet, different defaults.

// guards verification gate· read-before-overwrite· loop guard· failed-path blacklist· moderation off

// questions

What people actually ask.

Is this fully autonomous like XBOW or PentestGPT?+
No, deliberately. Fully autonomous pentest tools either over-execute (hit out-of-scope assets) or under-execute (paralyzed by ambiguity). Human-in-the-loop is the right pattern for billed engagements where mistakes have legal consequences.
Will it stop before doing something destructive?+
Yes. Anything that mutates beyond enumeration / read-only requires explicit confirmation. The verification gate runs each step's artifacts before declaring success. Loop guards prevent the same dead-end from being retried.
Can I use it on client engagements?+
Yes. Same trust model as Cobalt Strike, Mythic, or any other commercial offensive tool — authorization sits on you and the engagement scope. Enterprise tier ships with NDA + per-engagement workspace isolation.
Does it produce a usable report at the end?+
Yes. Audit trail → client-ready report. CVSS scoring, evidence collection, remediation suggestions, executive summary. Cuts pentest report turnaround from days to hours.
How does the playbook system work?+
Playbooks are structured YAML / JSON files describing engagement steps, confirmation thresholds per step, and OPSEC parameters. Standard playbooks ship with the product for the common engagement types. Custom playbooks are easy to write — most consultancies extend the standard playbooks with their internal conventions.
Can it run multiple concurrent engagements?+
Yes — Pro+ and Enterprise tiers support concurrent engagements with per-engagement workspace isolation. The live dashboard lets ops teams monitor multiple engagements without cross-contaminating their state.
What about cloud engagements specifically?+
Standard cloud playbooks for AWS, Azure, GCP. IAM enumeration, role-assumption chains, S3 / blob exposure analysis, metadata service abuse research, Lambda / Functions abuse paths, container-escape research. Pacu / ScoutSuite / CloudSploit integration for the boilerplate scanning.
How does it handle authorisation boundaries dynamically?+
The scope file is authoritative. Anything outside scope is hard-blocked. Anything that touches a defined boundary (e.g., a tenant boundary in a multi-tenant audit) requires confirmation. The audit trail records every boundary check, so the deliverable shows you respected the engagement parameters.

// ready

Stop fighting refusals.
Start shipping the engagement.

One tier covers most engagements at $20/month. If the agent ever refuses, hedges, or returns neutered output on legitimate engagement work, we refund — see the refund policy.

refund if it ever refuses · no card on file · crypto-only