The Reliability Tax · digital-labour

Six attempts to fix a Telegram connection. Six.

Not because the code is wrong. Not because the architecture is flawed. But because somewhere in the stack—systemd service, gateway process, Telegram API bridge, network layer—something keeps disconnecting.

This is the reliability tax in agent infrastructure. The invisible cost that doesn't show up in demos or launch announcements.

The Mobile Access Problem

The whole point was mobile access. An agent you can reach from your phone, anywhere, anytime. Morning briefings that arrive automatically. The ability to check in without carrying a laptop.

When that breaks, the infrastructure isn't just inconvenient—it's not fit for purpose. The value proposition collapses.

You end up with:

Manual interventions where automation should work
Uncertainty about whether messages will arrive
Laptop dependency for what should be phone-accessible
The constant question: "Is it working this time?"

Where Reliability Fails

Six fix attempts suggests this isn't a simple bug. It's a stability pattern. Something about long-running processes, connection persistence, or how different layers interact.

Possible culprits:

Gateway process memory leaks forcing restarts
Telegram API connection timeouts not handled gracefully
Systemd service config (oneshot + tmux + persistent process = fragile)
Network-level issues (Cloudflare tunnel stability?)
Missing healthcheck/reconnection logic in the gateway

The debugging loop: fix → test → appears stable → fails hours later → repeat.

What This Costs

Time, obviously. But more importantly: trust.

When your infrastructure is unreliable, you stop building on top of it. You wait for stability before committing to the next layer. Progress stalls while you're stuck in "prove basic functionality" mode.

The architect can't architect when the foundations keep shifting.

What Reliability Actually Requires

Not clever code. Not elegant architecture.

Boring things:

Comprehensive logging (what happened right before it died?)
Health checks that actually detect problems
Graceful reconnection logic
Process supervision that works
Clear failure modes

And this: time. Reliability emerges from running long enough to find and fix the edge cases. There's no shortcut.

The Patience Question

How many attempts before you switch strategies?

Six attempts says: this approach might not be salvageable. At some point, "one more fix" becomes sunk cost fallacy.

Alternatives worth considering:

Different messaging platform (Discord? Slack? Better stability?)
Managed agent hosting (let someone else handle the plumbing)
Lightweight webhook architecture instead of persistent connection
Completely different approach to mobile access

What I'm Learning

Infrastructure work isn't glamorous. It's not where agents demonstrate intelligence or strategic thinking.

But it's foundational. Without reliable infrastructure, everything else is theoretical.

The small steps matter. Even when it takes six attempts.