Skip to content
← Blog

Why agent costs explode: the quadratic context tax

3 min readAILLMcostagents

Most teams budget for LLM features like they budget for an API call: price per request, multiply by traffic, done. Then they ship an agent (something that calls tools, reads results, and decides what to do next) and the bill arrives three to ten times higher than the spreadsheet said.

The gap is almost always the same thing. It isn't a pricing surprise. It's the shape of the workload.

A single agent request is not a single model call

When an agent uses tools, one user request fans out into a loop:

  1. The model reads the prompt and decides to call a tool.
  2. Your code runs the tool and appends the result to the conversation.
  3. The model reads the whole conversation again (original prompt, its own previous output, and the tool result) and decides what to do next.
  4. Repeat until it answers.

Three tool-calls isn't one model call. It's four, and crucially, each call re-sends everything that came before it. LLMs are stateless; the only way the model "remembers" step two at step four is that you pay to send step two's tokens again.

The math

Let b be your base prompt size and t the number of tool-calls. Each turn re-sends the prompt plus everything generated so far. Total input tokens for one request land around:

input ≈ b·(t+1)  +  (output + tool_result)·(t·(t+1)/2)

That second term is the tax. The t·(t+1)/2 is the sum 1 + 2 + … + t, and it grows with the square of the tool-calls, not linearly. Double the tool-calls and you roughly quadruple the context you pay for.

A concrete example: a 2,000-token prompt, 800-token answers, 6 tool-calls on a premium model isn't "6× a simple call." It's tens of thousands of input tokens per request, and at frontier-model input prices it can cross six figures a month at modest traffic.

You can plug your own numbers into the AI Agent Cost Calculator: change the tool-call slider and watch the monthly figure move non-linearly. That curve is the whole point.

Why it's easy to miss

  • Demos hide it. A demo runs one happy-path request with one tool-call. The quadratic term is invisible until tool-calls climb in production.
  • Output tokens look scary, input tokens are the bill. Teams optimize response length. But on agents, re-sent input context usually dominates, often 3–5× the output cost.
  • RAG stacks on top. Retrieval injects documents into the base prompt b, so every one of those re-sends gets heavier too.

Cutting the tax

You rarely need a cheaper model. You need a smaller, smarter loop:

  • Prompt caching. Providers will cache a stable prefix (system prompt, tools, retrieved context) so re-sends bill at a fraction of the input price. This is the single biggest lever for agents and it targets exactly the term that's exploding.
  • Fewer, fatter tools. Five narrow tools that each need a round-trip cost more turns than one tool that returns what the model actually needs. Every turn you remove is removed from the squared term.
  • Trim what you re-send. Summarize or drop stale tool results instead of carrying the full transcript to the final turn. Cap the loop.
  • Route by difficulty. Use a small model for the routing/tool-selection turns and the expensive model only for the final synthesis.
  • Batch and pre-compute. Embeddings and any deterministic steps don't belong in the hot agent loop.

In practice, caching plus a tighter loop routinely takes a five-figure monthly estimate down by more than half, without changing what the product does.

The takeaway

Agent cost isn't "tokens × requests." It's a loop whose context grows with the square of its tool-calls. Model that curve before you ship, design the loop to keep t small, and cache the prefix that gets re-sent. Get those right and the scary number on the calculator becomes a line item you can defend.

Want a real workload pressure-tested? That's exactly what a production-readiness audit digs into.

Working through something like this? I help teams ship AI and cloud systems that hold up, and cost what they should.