A 3% Bug Hid in My SaaS for a Month — My AI Traced Root Cause in an Hour

5am last week. A Telegram alert woke me up: a new Newton customer had signed up, but their server didn't come up. Two hours earlier, the previous customer provisioned fine. Same script. Same snapshot. No code change. The provision pipeline just died silently in the middle and never finished.

This is the kind of bug every non-developer SaaS owner dreads. It only hits some customers. You can't reproduce it locally. The error log is cryptic. And the customer who got hit just refunds and quietly leaves.

I didn't sit down and start digging. I opened a terminal, typed one sentence to my AI agent, and went back to sleep.

The Setup: Customer 18 Broke, Customer 17 Was Fine

When someone signs up for Newton, the system calls Hetzner to spin up a VPS, restores from a snapshot, then runs a provisioning script that installs Tim Chat, configures Claude Code, writes a .env file with the customer's credentials, and starts the newton.service daemon. Then a welcome email goes out with the login URL and password. The whole pipeline takes about 2 minutes.

17 customers had been through this pipeline. All 17 worked. Server boots, service starts, customer logs in. Done.

Customer 18 — the script ran for a while and then died. set -e caught a non-zero exit somewhere mid-run, abort fired, and .env was never created. newton.service tried to start without its config, crashed, restarted, crashed again. By the time I saw the alert, systemd had crash-looped over a thousand times.

I Handed It to My AI With One Sentence

I had no idea what the bug was. I literally told the AI: "New customer didn't provision. Old one did. Go look."

What happened next is the part I think most people miss when they hear "AI agent." This wasn't ChatGPT trying to guess from a description. It was an agent with full SSH access to the production servers, the ability to read every log file, the ability to compare two systems side by side.

It picked the two closest provisions in time:

Server 17 (customer Anupap) — provisioned at 03:23 UTC, exit 0, healthy.
Server 18 (the new customer) — provisioned at 05:00 UTC, exit 9, dead.

It pulled both provision logs and diffed them line by line. Most of the diff was noise — different hostnames, different IPs, different timestamps. But one diff was meaningful:

The random password generated for customer 18 started with a dash: -BATDNqXyz...

That was it. One character.

How One Dash Killed an Entire Pipeline

This is where the bug gets interesting.

Inside the provisioning script, there's a line that hashes the password before writing it to the database. It looks something like:

node -e "...hash script..." "$PASSWORD"

The intent is obvious: pass the password as a positional argument to the Node script. The Node script picks it up from process.argv[1], hashes it, writes it to the DB.

The problem: Node's CLI parses argv looking for flags before the script gets a chance to run. Anything starting with - is treated as a Node command-line option. If the password starts with -B, Node interprets that as -B — an unknown flag — prints node: bad option: -B, and exits with code 9.

Because the parent script has set -e, the moment Node returns non-zero, the whole script aborts. .env never gets written. The service can't start. The user can't log in. Welcome email goes out anyway with credentials that lead to a dead login page.

Beautiful, quiet, terrible bug.

Why the Bug Slept for a Month

The harder question is: this bug was there from day one of Newton. Why did it take a month to surface?

The auto-provision system uses Python's secrets.token_urlsafe(16) to generate passwords. token_urlsafe uses a 64-character alphabet: A-Z, a-z, 0-9, -, _.

Probability the first character is - or _: 2 / 64 = 3.125%.

So statistically, roughly 1 in 30 customers would hit this. With 17 prior customers, all of them lucked out — every single random token started with a letter or digit. Customer 18 was the first to roll a dash.

This is the kind of bug a developer almost never finds in dev environment, because nobody runs the full end-to-end pipeline a hundred times to chase a 3% edge case. You only see it in production, against real users, when the random number generator finally lands on the unlucky byte.

The Fix Was Two Characters

The repair was tiny: add a -- separator before the user-input argument.

node -e "...hash script..." -- "$PASSWORD"

UNIX convention: -- means "stop parsing flags here, everything after this is a positional argument." With it in place, Node stops trying to interpret the password as a flag and just passes it through.

Before applying the fix to production, the AI ran a reproduction test:

node -e '...' '-BATDNqXyz...' → exit 9, broken
node -e '...' -- '-BATDNqXyz...' → exit 0, works

Two characters. Confirmed via real reproduction. Same instinct I saw when my AI wrote 11 regression tests before patching a chat-truncation bug across 6 customer servers — verify before deploy, every time.

But the AI Didn't Stop at the Fix

This is the part I want to highlight.

A "use AI to write code" tool would have stopped at the two-character patch. Bug reported, bug fixed, ticket closed. My AI agent kept going, because the right question after this kind of failure isn't "how do I make this specific symptom go away" — it's "if a different 3% bug shows up tomorrow, will I find it faster?"

So in the same hour, the agent added seven layers of defense to the provisioning pipeline:

Master provision.sh moved to /opt/newton/setup/. Previously the script was baked into the Hetzner snapshot — fixing it required rebuilding the snapshot in 3 regions. Now it's scp'd onto the new VPS at provision time. Future fixes apply on the next signup, no snapshot rebuild.
EXIT trap that prints the dying line. If the script dies again, the log will include LINENO + BASH_COMMAND for the exact line that failed. No more guessing.
Sanity check before returning success. Before provision.sh exits 0, it verifies .env exists and systemctl is-active newton.service returns active. If either fails, exit with a clear error code.
One retry with a 30-second backoff. Network glitches happen. A single retry catches transient failures without masking real bugs.
Full stdout/stderr logging. Every provision run now writes to /var/log/newton/provision-{server_id}-{ts}.log. No more > /dev/null hiding the truth.
Suppress the welcome email on provisioning failure. This was the most important one to me. Previously the welcome email fired regardless of whether the server came up. So a customer with a broken server would get an email with credentials that lead to a 502 page. Now if provisioning fails, status is set to error and the email is held.
Telegram alert to me on every failure. If the next 3% bug shows up, I know within seconds — not at 5am from a customer email.

From the moment I woke up to the moment all of that was deployed and tested in production: under an hour.

One more thing the agent flagged while it was upstream of this bug: a separate problem on the signup form itself. The customer-typed password field had been decorative since launch — and instead of fixing it, the agent argued for deleting it. That's a separate post.

Two Lessons — One for Devs, One for Founders

I'm publishing this story because there are two layers worth taking away.

For developers: any time you're passing a random secret as a CLI argument, use a -- separator. URL-safe base64 looks innocent until your RNG lands on a leading dash and silently flips a flag. This applies to Node, to Python's argparse, to git, to almost every CLI on your system. Two characters of paranoia save a 5am pager.

For non-developer founders: a 3% bug like this is invisible without 24/7 monitoring. It doesn't affect every customer, so most customers won't report it. The ones it hits won't always tell you — they'll just refund and disappear. This is exactly the failure mode that kills early SaaS: not the catastrophic outages, but the quiet leaks that look like normal churn.

The thing that catches it is an agent that's actually watching the system, not waiting for a ticket. A chatbot stops the moment you close the tab. An agent on your own server runs whether you're asleep, on a flight, or eating dinner.

Why Newton Is Managed

People ask all the time how Newton is different from "rent a VPS and install Claude Code yourself." This story is the cleanest answer I have.

If those 18 customers were self-hosting, the bug would still exist on their servers — including the customer who got hit. They'd have to know there's a fix. They'd have to git pull and re-run a provision step. Most of them wouldn't, because they're not developers, and the bug would just sit there waiting for the next time a random password rolls a dash.

Because Newton is a managed private server, the customer didn't have to know. The fix went out the same hour, and the next signup got the patched script automatically. Hotfixes flow out to every customer server the moment they're ready.

The model is: you own the server, you own the data, you own the agent. We keep it healthy. You build your business. We chase the 3% bugs.

If you want an AI agent that catches this kind of edge case at 5am while you sleep — fixes itself, deploys the patch, and adds defensive guards so the next bug surfaces faster — try Newton. Setup is about 10 minutes, and the kind of investigation in this post is included.

— Pond

A 3% Bug Hid in My SaaS for a Month — Only 1 in 30 Customers Hit It, and My AI Found It in an Hour