Ask three engineers how many instances a service needs and you'll get three answers, all of them round numbers, none of them derived. Then the service either falls over under a launch or quietly burns budget running at 8% CPU. Both are the same mistake: sizing by feel instead of by arithmetic.
The arithmetic is one line, and it's been sitting in queueing theory the whole time.
Little's Law: L = λ · W
The average number of requests in flight in any stable system equals the arrival rate times the average time each request spends in the system:
L = λ · W
L = concurrent requests in flight
λ = arrival rate (requests per second)
W = time in system (latency, in seconds)
It's almost suspiciously simple, and it doesn't care about your language, framework, or cloud. If you take 5,000 requests per second and each one takes 80ms, then on average you have:
L = 5000 × 0.08 = 400 requests in flight, always
Four hundred. Not "a lot." Not "depends." Four hundred concurrent requests is the load your fleet has to hold at any instant, and every sizing decision flows from that one number.
Plug in your own throughput and latency in the Throughput & Concurrency Calculator and watch the fleet size fall out of the math instead of out of a meeting.
From concurrency to instances
Now divide. If one instance can safely hold 50 concurrent requests, then in the limit it clears:
per-instance throughput = 50 / 0.08s ≈ 625 requests per second
To serve 5,000 req/s you need enough instances to cover that — and you never want to run flat out, so size for a target utilization (say 70%) to leave headroom for spikes:
instances = ceil( 5000 / (625 × 0.70) ) = ceil(11.4) = 12
Twelve instances, with a reason behind the number you can put in a design doc and defend in review. Not "let's start with 10 and see."
The lever almost everyone misses
Look again at L = λ · W. Your traffic λ is mostly given — it's demand. The
variable you actually control is W, the latency. And it's multiplicative:
Halve the latency and you halve the concurrency — and halve the fleet.
A 40ms optimization on an 80ms request isn't a nice-to-have; at 5,000 req/s it's six fewer instances, every hour, forever. That's why the highest-leverage capacity work is usually a profiler session, not a bigger autoscaling ceiling. The cheapest instance is the one you didn't need because the request got faster.
Latency is the lever, so it's worth budgeting deliberately — see A latency budget you can defend in review and the Latency Budget Calculator.
Where the simple model needs a footnote
Little's Law gives you the floor, not the whole story. Real systems add things the identity doesn't model:
- Queueing isn't linear near saturation. As utilization climbs past ~80%, latency rises sharply (that's why we sized for 70%, not 100%).
- Connection and pool limits can cap concurrency below what CPU allows — a database with 100 connections won't hold 400 in-flight queries.
- GC pauses, cold starts, and noisy neighbors all inflate
Win ways an average hides; size against your p95, not your mean.
None of that invalidates the math — it just means the number you get is the minimum fleet, the starting point you then stress-test. But starting from a derived minimum beats starting from a guess every time. Capacity planning isn't a dark art. It's one multiplication and one division, done before the launch instead of during it.