As we ship faster with AI, the test net underneath has to keep up or speed becomes risk.
AI has changed how much code we produce and how fast it lands. But every line still has to answer the same question the moment someone touches it: did we just break something that already worked? Test coverage is the safety net that lets a team move at AI speed without quietly regressing the corners of the codebase no one remembers. The faster we move, the more that net matters.
So at Nirvana, we made it a priority to solve: not by asking engineers to write more backfill tests, but by building an autonomous agent that does it for us, inside guardrails strict enough that we trust it to run unattended.
This post is the full walkthrough: the architecture, how we prioritize what the bot works on, how it remembers what it has done (and why that state now lives in S3), what the cadence looks like, and the safety model that makes the whole thing sleep-at-night safe.
The problem we were actually solving
The hard part of coverage was never the standard. We can set a bar and enforce it in CI. The hard part is the grind of backfilling tests for code that already shipped - the unglamorous, never prioritized work that sits at the bottom of every backlog. No one wants that ticket. It competes with features, and it loses.
To be precise about where the gap actually is: new code is not the problem. Anything written today already lands with 90–98% statement (line) coverage as a matter of normal process. Review and CI hold that line, and engineers write those tests as part of shipping the feature. The debt lives in legacy code: the large body of older code that predates today's bar and never reached 100%. And with years of accumulated code, that legacy surface is enormous. Closing it by hand would mean pulling engineers off product work for a very long time to write tests for code they may not have touched in years. That math never works out, which is exactly why the gap persists everywhere.
The result is the familiar slow rot: the untested regions cluster in exactly the older, load-bearing code that's scariest to change, and every refactor through it is a little more dangerous than it should be.
This framing is what shaped the whole design. The bot isn't a substitute for engineers writing tests for their own new code. That already happens. It's a tireless worker pointed squarely at the legacy backlog, chipping away at the older, under-covered code continuously so the long tail actually gets closed instead of being perpetually deferred.
We didn't want a dashboard nagging us about it. We wanted the work to get done — continuously, quietly, and at a quality bar we'd accept from a human engineer. That last point matters as much as the coverage number itself: a percentage bumped up with hollow, assertion-free tests would be worse than useless. So test quality (meaningful assertions, real edge cases, tests we'd actually trust during a refactor) is a first-class constraint on the bot, not an afterthought. We return to exactly how we enforce that in its own section below.
What we built, in short
A scheduled agent (powered by Claude) wakes up several times a day during working hours, computes where coverage is weakest among the packages that matter most, picks one under-covered file, studies the existing tests around it to learn the team's conventions, writes focused tests for the uncovered paths, and opens a small, reviewable PR. A mechanical safety gate stands between the agent and the repository. From there, the PR enters a fully automated review-and-merge loop: a multi-model reviewer critiques it and reaches consensus, a second autonomous agent resolves whatever the reviewer flags, and once the panel agrees, the PR merges itself. End to end, with no one babysitting the queue.
It first proved itself in a smaller service repo, then we adapted the design for our main backend service. Everything below describes that production system.
The architecture

Let's walk the pipeline stage by stage.
1. Trigger and the guard job
The bot runs on a cron schedule, but it doesn't blindly do work every time it fires. The first thing each run does is consult a lightweight guard job that decides whether this run should proceed at all. The guard skips the run (cheaply, costing only a few cents) when:
- The open-PR cap is already reached (we don't want the bot flooding the review queue).
- A freeze window is active (set automatically when a scope hits full coverage, so the bot stops generating noise instead of repeatedly concluding there's nothing to do).
A skipped run posts a short Slack note explaining why it skipped, so silence is never ambiguous.
2. Baseline coverage, but only where it counts
If the guard clears, the run computes a fresh coverage baseline. In a large backend this is the expensive part, so two things matter:
- It's scoped, not exhaustive. We don't measure the whole repo every run. We measure a curated set of priority packages (more on prioritization below).
- It's parallelized. Coverage profiles are generated across packages concurrently and then summarized per profile rather than as one merged blob, which avoids a class of cross-package resolution errors that used to kill the whole step when a single transitively-referenced package couldn't be resolved.
One hard-won detail: a package is considered eligible based on whether it produced a usable coverage profile, not on whether all of its tests passed. A single pre-existing flaky integration test used to silently exclude an entire package's worth of uncovered code from every run. Decoupling eligibility from the test exit code fixed that.
3. Prioritization: picking the components that matter
This is the part most "auto-coverage" tooling gets wrong. The naive approach is to chase the lowest-coverage file in the repo. In a large codebase that's actively harmful: it spends the budget on whatever happens to be numerically lowest (often peripheral or generated code) and floods reviewers with PRs against things nobody cares about.
Instead, the bot works off a curated priority allowlist of packages, ordered deliberately:
- Revenue-path and business-critical components first. The packages where a missed edge case has real consequences sit at the top of the list. Coverage there is worth far more than coverage on a leaf utility.
- Recent incident learnings feed the ordering. When something breaks in production, the affected package's priority goes up. The list is a living reflection of where we've actually been burned.
- Tiering, not a flat list. Higher-tier packages get attention before lower-tier ones, so the budget always flows to the highest-leverage gap available on any given run.
The allowlist is also a containment boundary: the safety gate later refuses any PR that targets a package not on the list. Prioritization and safety are the same mechanism viewed from two angles: the bot can only ever work on code we've explicitly decided is worth its time.
On top of the static priority order, a per-file cooldown (14 days) prevents the bot from churning the same file repeatedly, and a source-file-level dedup check ensures two open bot PRs never target the same file, even when they live in the same package.
4. Memory: from a committed file to S3
An autonomous agent that forgets everything between runs will duplicate work, re-add helpers that already exist, and leave no audit trail. So the bot keeps cross-session memory: a record of what prior runs added, which files are in cooldown, and what scenarios and helpers already exist to be reused.
Originally this memory lived as a committed JSON file inside the repo. It worked, but it had structural downsides:
- Every bot PR carried a mutation to the memory file, adding commit noise to otherwise clean test-only diffs.
- Memory state was coupled to merge timing: it only became "true" once a PR merged, so concurrent runs could read stale state.
- Two in-flight bot PRs editing the same memory file produced merge conflicts that had nothing to do with the actual tests.
We recently moved the memory state out of git and into S3. The bot reads the canonical memory object from S3 at the start of a run and writes the updated state back at the end, independent of the git history and of whether any PR has merged. This decouples memory from merge timing, eliminates the conflict class entirely, and keeps the PR diffs purely about tests. Access uses the AWS OIDC credentials the repo's CI already had configured, so we didn't introduce a new long-lived secret to get there.
The net effect: the audit trail and dedup logic got more reliable while the PRs got cleaner.
5. The agent
Now the interesting part. With a baseline computed, an exclude list built, and memory loaded, the run hands off to the Claude agent with a tightly scoped brief: pick one eligible file and raise its coverage.
What makes the output look like ours rather than generic AI tests is that the agent studies the existing tests around the target first: the table-driven patterns, the hand-written mock style, the shared test-environment helpers, and the assertion conventions the team already trusts. It writes new tests in that idiom rather than inventing its own. Because it learns from the tests you've already written, the net it builds looks like one your team would have built itself.
The agent writes exactly one *_test.go file to the runner's local disk. It does not touch source. And, critically, it has no credential that could push anything anywhere. It runs with the harness's internal permission prompts bypassed (so it can run go test, the linter, and the build without stopping to ask), but the write-scoped token is deliberately withheld from its environment. The worst it can do is write files on a throwaway runner that gets discarded.
6. The mechanical safety gate
After the agent finishes and before anything is pushed, a mechanical gate inspects the diff and rejects it unless every hard rule holds:

This gate is the load-bearing safety boundary. Not a prompt asking the model to behave, but a deterministic check that runs regardless of what the model did. If the agent ever goes off the rails, the gate is what catches it, and the gate holds the credentials, not the agent.
7. Push, PR, and the handoff
Only once the gate passes does a separate, gated step push the branch and open the PR, using a dedicated GitHub App token scoped to that step alone (using the App, rather than the default Actions token, is also what lets downstream CI fire on the bot's PRs). The PR opens with labels that identify it and route it into automated review and the SDLC pipeline.
8. The automated SDLC loop: review, resolve, approve, merge
The bot doesn't stop at the open PR. Opening it is the start of a fully automated review-and-merge loop, powered by two systems we built in-house: Argus, an automated reviewer, and Forge, an autonomous developer.
The automated reviewer — Argus. When a PR opens, Argus reviews it across about ten dimensions (correctness, security, performance, testing, and so on) by fanning the diff out to multiple independent AI models and building consensus across them: findings several models agree on score high-confidence, lone-wolf findings lower. It posts inline comments and a verdict. Agreement across a diverse panel is a far more trustworthy signal than any single reviewer, which is what lets us trust it without a human in the loop.
The autonomous developer — Forge. Review comments only matter if someone acts on them. That's Forge: it spins up an agent in an isolated sandbox loaded with the repo and its context files, takes the review comments as its task, makes the changes, and pushes a fix commit back to the same PR, behaving like an engineer who picked up the feedback and revised.
Put together, the loop runs itself:
.avif)
So the end-to-end story for a single unit of coverage debt is: the coverage bot notices the gap and writes the tests, the automated reviewer assesses the result from ten angles across several models, the autonomous developer resolves whatever gets flagged, the reviewer re-checks the fix, and once the panel reaches consensus the PR merges, with the whole lifecycle mirrored to Slack so the team has visibility without watching the queue. A human only gets pulled in when the system itself is uncertain: a high-severity finding, or a case where the models disagree. The routine work closes itself out; people spend their attention only where judgment is actually required.
Coverage is the proxy; quality is the point
A coverage percentage is only a proxy, and the obvious failure mode of any coverage automation is a bot that games it: calling a function and asserting nothing just to turn a line green. That moves the metric and protects nothing, so we designed against it on three levels.
The agent tests behavior, not lines. Its brief is to write meaningful assertions around real inputs, error conditions, and boundaries, modeled on the existing tests it studies in the package. Because it mirrors the team's established table-driven patterns and fixtures, what it produces reads like tests a careful engineer would write, not coverage filler.
The reviewer judges the tests, not just the diff. Test quality is one of the explicit dimensions the automated reviewer scores: it flags assertion-free tests, tautologies, over-mocking, and tests that don't actually exercise the path they claim to. Weak tests get sent back to the autonomous developer to strengthen, or held for a human. A test that games the metric doesn't survive review.
Humans still own the bar. Every PR is small, single-file, and reviewable in minutes, and low-confidence cases route to a person. The goal was never a higher number on a dashboard; it's tests we'd be comfortable depending on during a refactor.
A note on what the numbers mean: the coverage figures in this post are statement (line) coverage as reported by Go's built-in tooling, not branch coverage. We treat the percentage as a signal of where to look, not the definition of done; the quality gate above is what decides whether a test is worth keeping.
What the cadence looks like
The bot is intentionally a slow, steady background process, not a burst job. It runs only during working hours on weekdays, so any PR it opens lands while reviewers are around and feedback loops stay tight:

The philosophy behind the numbers: conservative on purpose. A small number of small PRs a day, each easy to review, compounds into meaningful coverage over weeks without ever overwhelming the team or the CI fleet.
Cost
Each PR runs on a mid-tier model for a modest number of turns, so the per-PR cost is low enough that the economics are trivially justified. Skipped runs (cap reached, frozen, or nothing eligible) cost almost nothing, since they stop at the guard job. For continuously closing coverage debt on the code that matters most, it's an extraordinarily cheap engineer.
What generalizes
A few lessons held up well enough that we'd apply them to any autonomous agent, not just this one:
- Bound the problem before you automate it. The priority allowlist isn't just prioritization; it's the safety boundary. An agent let loose on "the whole repo" is far harder to trust than one that can only act inside an explicitly chosen scope.
- Make safety mechanical, not conversational. Instructions are requests; gates are guarantees. The credential lives with the gated step, never with the reasoning agent. That separation is what makes unattended operation defensible.
- Externalize state. Moving memory out of git and into S3 decoupled the bot's bookkeeping from merge timing and killed an entire class of conflicts. Durable, single-source-of-truth state is worth the small upfront cost.
- Let it learn from you. The agent reads existing tests before writing new ones, so the output matches the team's conventions. The result reads like your code because it was shaped by your code.
- Aim AI at the unglamorous middle. Agentic AI earns its keep not on flashy greenfield work, but on the steady hygiene teams perpetually defer. Point it at a well-bounded problem, wrap it in strict gates, and let it compound.
That last point is the one we keep coming back to. The exciting demos are all greenfield. The real leverage, at least for us at Nirvana, has been pointing these agents at the work that was always important and never urgent, and finally getting it done.
Building toward more of this kind of leverage is a big part of how we think about engineering at Nirvana.
%201.jpg)









