DevOps2026.04.189 min read

Parallel Code Audits with Agent Teams — Five Opuses Arguing at Once

A hands-on account of using Claude Code's Agent Teams to run a five-agent parallel audit across the whole project. Role separation, the false-positive retraction workflow, and a few surprising findings.

claude-codeagent-teamscode-auditai-toolingproductivity

Opening — Hitting a Wall as the Project Grew

This blog is developed entirely solo. In the early days — a handful of pages and two or three games — I could keep the whole codebase in my head. Over time that broke down. Containers, hooks, and shared utilities piled up, and subtle side effects of the same patterns started scattering across files. I hit the point where "I should really audit the whole thing".

So on April 10, 2026, I started a paid Claude subscription. The reasoning was straightforward:

The site had grown noticeably (file count, feature count, external API integrations)
Keeping solo tabs on codebase-wide consistency, security, and performance had hit its limit
Large-scale refactors and optimizations needed a reliable "reviewer AI"

For a few days after subscribing I used Claude Code's everyday features — file editing, testing, refactoring — to get comfortable. Then I noticed Agent Teams: a setup where multiple AI agents run in independent parallel sessions and can message each other. The thought arrived naturally: "what if I ran five Opus agents in parallel and had them audit the whole project?" I ran it, and the findings exceeded expectations.

This is the write-up.

What Is Agent Teams?

An experimental Claude Code feature where multiple agents run concurrently in independent sessions and can message each other. It's more than parallel invocation:

Each agent has its own context
A shared task list distributes work
SendMessage enables agent-to-agent debate / collaboration
The leader (main session) makes final decisions and applies fixes

"One AI with several hats" and "several AIs with distinct roles arguing with each other" produce very different outputs. The latter yields surprisingly thorough audits.

Team Composition — 5 Roles

The team I assembled:

Role	Agent Type	Area
Frontend auditor	`yujh-auditor-frontend`	components / views / routing / styles / a11y
Core logic auditor	`yujh-auditor-core`	business logic / state machines / pure functions
Data auditor	`yujh-auditor-data`	API / models / storage / caching / security
Infra auditor	`yujh-auditor-infra`	build tools / deps / CI / manifest
Verifier	`yujh-auditor-verifier`	disproves other auditors' findings

The key piece is the Verifier. When the four auditors raise 🔴 issues, the verifier independently checks "is this really a problem?". Its default assumption is "every 🔴 might be a false positive". A devil's advocate that doesn't simply trust auditor claims.

The Actual Audit Flow

1. Leader Preflight

The main session (leader) surveys the project and builds a project profile — framework, build tooling, rules files, etc. This profile is passed to each teammate at spawn so they know "you're auditing this project".

2. Parallel Audit Starts

Each auditor walks its domain in its own session, raising 🔴 issues as tasks. Example titles:

B2-01 [data]     NASA API key exposed in client bundle
B2-02 [data]     visit_log lacks spam prevention
D-03  [infra]    GitLab Variables Protected flag not verified
A-08  [frontend] Cosmic Barrage resize debounce 400ms lacks comment
B1-07 [core]     useCosmicBarrageAudio missing visibilitychange handler
...

3. Verifier Disproves

Verifier claims each 🔴 task. Re-reads the actual code and renders:

✅ Valid — confirmed; needs a fix
❌ False positive — rebut via SendMessage to the auditor
⚠️ Partial — partly right; propose downgrade to 🟡

4. Cyclical Debate → Leader Escalation

If auditor and verifier do 3 round trips without agreement, the verifier escalates to the leader. Leader reads the files directly and makes the final call.

Memorable Findings

Real ones from the actual audits:

🔴 → ✅ Confirmed

NASA API key exposed in the client bundle (B2-01) — the 40-char key was readable in the deployed bundle. Switched to edge injection via CloudFront Function
Guestbook lacked rate limiting — only 30-second session cooldown; swapping sessions bypassed it. Added daily and per-post caps via triggers

🔴 → ❌ Retracted

"Exoplanet API rewrite doesn't work in prod" — the data agent raised this citing that next.config rewrites() is dev-only. Leader checked: the production CloudFront has a behavior-based proxy for that path. False positive — couldn't be verified from files alone since it's infrastructure outside the repo
"useMemo(() => getUserId(), []) re-runs every render — perf issue" — verifier's initial verdict. Reality: the SSR-computed null gets cached permanently — a real bug. The auditor's rebuttal kept it as 🔴

🔴 → 🟡 Downgraded

Many "optimization opportunity" issues were verdicted "acceptable as-is for now" and downgraded to 🟡 (backlog)

Real Numbers

Results across two runs:

Metric	Value
Total 🔴 raised	30+
Actually fixed	8
False positives retracted	4
🟡 downgrades	6
POST-AUDIT pending	2

The "4 false positives" number matters. If only the auditors had run (no verifier), those 4 likely would have been fixed — breaking actually-fine code and introducing new bugs. The devil's-advocate role paid off concretely.

Limitations — Know Before You Use

Agent Teams is powerful but has clear constraints.

No Session Resume

Create a team, pause it, and you can't pick up later. In-process teammates don't restore via /resume or /rewind. Design for one-shot runs.

One Team Per Session

Can't have multiple teams simultaneously. To re-invoke, the previous team must be explicitly cleaned up (TeamDelete).

Shutdown Delay

Even after sending shutdown_request, teammates only terminate after their current turn. Can take minutes.

Cost

Five Opuses in parallel isn't cheap. Weekly regular audits would be excessive — it fits "event-based" usage better: pre-release deep audits, post-large-refactor inspection.

Usage Tips

Pass Project Profile Explicitly

Include the leader-detected info (framework / build / rules files) in the spawn prompt. Otherwise the auditor can't judge against the project's conventions.

Aggressive Verifier Prompt

Strongly state "Verifier must suspect every 🔴 as a false positive". Otherwise verifiers tend to accept auditor claims verbatim.

Account for Out-of-Repo Infrastructure

3 of the 4 false positives this time involved CloudFront settings outside the repo. Auditors only see files, so assumptions about CloudFront / nginx / external API config must be re-verified by the leader.

Retrospective

Running audits via Agent Teams made "multiple AIs collaborating" feel concrete. The false-positive retraction workflow was the standout — findings that would have been silently fixed by a solo auditor got rebutted and retracted after the verifier checked actual infra.

The biggest value: even a solo project can get peer-review-like pressure. Especially powerful for global infrastructure and security issues.

If you're about to ship a big release, or need to sanity-check after a large refactor, it's worth trying. There's cost, but it's smaller than finding that one missed issue in production.

Guestbook

Leave a short note about this post

← DEVLOG