"The API should feel fast" is not a target. Neither is "p95 under 250ms" until you've said where those 250 milliseconds are allowed to go. Without that, every new feature quietly borrows from a budget nobody is tracking, and one day the endpoint is at 600ms and no single change looks guilty.
A latency budget fixes this the same way a financial budget fixes spending: you decide the total up front, allocate it, and then every request for more has to come from somewhere.
Start with the number you'll promise
Pick the end-to-end target you're willing to put in a contract or an SLO — say a 250ms p95. That's the whole budget. Now spend it across the hops a request actually passes through:
Network / TLS 20ms
CDN / edge 10ms
App / business logic 60ms
Database 40ms
Cache 5ms
External API 30ms
Serialization 10ms
-----------------------------
Total 175ms (75ms headroom under 250ms)
Two things become obvious the moment it's written down. First, whether you're actually under budget. Second — and more useful — which single hop is eating the most of it. That dominant hop is almost always where the cheapest win lives.
Build your own allocation with the Latency Budget Calculator: set the target, slide each hop, and it shows the total, the headroom, and the biggest consumer.
The budget is a contract
The point of writing it down is what happens next time. When a feature wants to add a 60ms call to a recommendations service, you don't argue about whether it's "too slow" in the abstract. You look at the budget:
- 75ms of headroom? It fits. Ship it, and now you have 15ms left.
- No headroom? The feature isn't blocked — but it has to pay for itself. Something else gets faster, or moves off the critical path, or the call goes async. The budget turned a vague worry into a specific, solvable trade.
That's the whole trick. Performance stops being a thing you measure after the fact and start regretting, and becomes a number the team allocates on purpose.
Push work off the critical path
Most latency wins aren't about making a hop faster — they're about getting it out of the request entirely:
- Parallelize independent calls. Two 30ms calls in sequence cost 60ms; in parallel they cost 30ms. The budget only pays for the longest one.
- Cache the expensive, stable hops. A 40ms database read that's the same for every user for ten minutes shouldn't be on the hot path at all.
- Make it async. If the user doesn't need the result to render, it doesn't belong in the latency budget — fire it after the response.
Each of these buys back budget you can then spend on the features that actually need to be synchronous.
A budget for the mean, not the tail
One honest caveat: adding up per-hop numbers models the average path well, but tail latency compounds differently. Retries stack, a slow dependency drags the p99 far past the sum of means, and parallel calls are only as fast as their slowest branch. So budget the mean to make decisions, but measure the tail to catch the surprises — they rarely live where the average says they should.
Still, the discipline is what matters. A team that allocates its latency on purpose ships fast endpoints by default, because every millisecond has an owner and every regression has a place it has to come from. "Make it fast" becomes "here's the budget, where does your 60ms come from?" — and that's a question with an answer.
Latency is also what sets your fleet size — every millisecond you cut is fewer instances. See Little's Law in one afternoon and the Throughput & Concurrency Calculator.