← Blog

Stop reporting uptime. Start spending an error budget.

3 min readreliabilitySRESLO

Most status pages report uptime as a single proud number: 99.95% last month. It looks like a grade. It isn't one — it's a budget statement, and reading it as a grade is why so many teams either over-invest in reliability they don't need or get blindsided by the outage they didn't see coming.

The fix is small and changes how a whole team makes decisions: stop reporting the uptime you achieved, and start spending the error budget you're allowed.

An error budget is just the gap to 100%

Pick a target — a Service Level Objective. Say 99.9% of requests should succeed over 30 days. The error budget is everything left over:

error budget = 1 − SLO = 0.1%  ≈  43 minutes of downtime per 30 days

That 0.1% is not a failure waiting to happen. It's a resource you're meant to spend — on risky deploys, migrations, a chaos experiment, a Friday release. A month where you used none of your error budget doesn't mean you're winning; it often means you're shipping too slowly and your target is too low.

Translate your own target into real minutes — and a request budget — with the SLO & Error Budget Calculator. "Five nines" stops sounding aspirational the moment you see it's 26 seconds a month.

The burn rate tells you when to stop

The budget by itself is static. What makes it operational is the burn rate: how fast you're spending it relative to the window.

  • Burning at means you'll spend exactly the month's budget over the month. Fine. That's the budget doing its job.
  • Burning at 14.4× means you'll exhaust a 30-day budget in about two days. That's the classic threshold for a fast-burn page — something is actively on fire.

This gives you a policy that writes itself, with no debate in the moment:

  1. Budget remaining? Ship. Take the risk. That's what it's for.
  2. Budget spent? Freeze feature work and spend the next cycle on reliability until you're back in the black.

The argument about "should we slow down and fix things" stops being a matter of opinion or seniority. The budget already answered it.

Why this beats an uptime report

A reliability conversation built on uptime percentages drifts toward vibes — "feels stable lately," "we had a rough week." A conversation built on an error budget is concrete and forward-looking:

  • It sets an explicit, agreed target instead of an implicit "as close to 100% as possible," which is both impossible and ruinously expensive.
  • It makes the cost of unreliability visible before the incident, not after.
  • It gives product and engineering a shared number to plan against, so "move fast" and "keep it up" stop being in permanent tension.

And it scales down cleanly. You don't need a platform team or a tracing stack to start — you need one SLO that maps to real user pain, a way to measure it, and the discipline to act on the budget when it runs low.

Where teams get it wrong

  • Targeting 100%. There's no budget at 100%, so there's no room to ship and no signal when you're in trouble. Pick a number you can actually defend.
  • An SLO nobody acts on. A budget you never enforce is just a dashboard. The value is entirely in the policy — freeze when it's gone.
  • Measuring the wrong thing. "The server was up" isn't the same as "users got a fast, correct response." Tie the SLO to the request outcome users feel.

Reliability isn't a number you brag about at the end of the month. It's a budget you spend deliberately all month long — and the teams that treat it that way ship faster and break less, because they finally agree on what "enough" means.

Once you've set a target, put a price on missing it: the Downtime Cost Calculator turns the minutes you're allowed into the dollars at stake — which is usually what gets the reliability work funded.

Working through something like this? I help teams ship AI and cloud systems that hold up — and cost what they should.