writing
Engineering notes
Production AI and agents, cloud architecture, and what it actually takes to ship products that hold up. No fluff, just what I've learned building and shipping.
- 3 min read
Why agent costs explode: the quadratic context tax
Every tool-call an LLM agent makes re-sends the whole conversation. That turns a linear-looking feature into a quadratic bill. Here's the math, and how to cut it.
AILLMcostagents - 3 min read
Stop reporting uptime. Start spending an error budget.
Uptime percentage is a vanity metric. An error budget turns it into a decision: when to ship features, and when to stop and fix reliability.
reliabilitySRESLO - 3 min read
How many instances do you actually need? Little's Law in one afternoon
Most capacity plans are a guess. Little's Law turns requests per second and latency into the fleet size you actually need — no over-provisioning.
capacityscalingperformance - 3 min read
A latency budget you can defend in review
A p95 target is meaningless until you divide it up. A latency budget carves it across the hops a request takes, so performance is a number you can defend.
performancelatency - 3 min read
Claude vs GPT-4o vs Gemini: a real cost breakdown
List prices hide the real story. Here's how Claude, GPT-4o and Gemini actually compare once you account for context, tool-calls and the work each model gets done per dollar.
AILLMcost - 2 min read
Cutting LLM spend: caching, batching & context reuse
Five levers that routinely halve an LLM bill without touching what the product does — prompt caching, batching, context trimming, model routing, and killing redundant calls.
AILLMcostGCP