AI Integration · Cost Management
LLM API Costs Are Out of Control: A Production Guide to Cutting Your Bill
AI features ship fast. Then the monthly API bill arrives. Here's a systematic approach to understanding and reducing LLM costs without breaking the product.
Anurag Verma
7 min read
Sponsored
You build the feature. Users like it. Then the invoice lands and you spend the next two hours explaining to finance why your app spent $8,000 on an API last month.
This story repeats constantly across teams that moved fast with LLM integrations without building cost controls first. The inputs are simple: token counts and prices. The outputs, for anything with non-trivial traffic, are surprisingly large numbers.
Here’s how to understand where the money goes and what actually reduces it.
Where the Money Goes
LLM pricing is straightforward in theory: you pay per token, split between input (the prompt) and output (the completion). In practice, most teams underestimate how their input tokens accumulate.
Consider a customer support chatbot:
- System prompt: 800 tokens (your instructions)
- Conversation history: 1,200 tokens (last 6 exchanges)
- User message: 50 tokens
- Total input: 2,050 tokens — per request
If you’re running on a model priced at $3/million input tokens, and you handle 100,000 support sessions per month with an average of 5 exchanges each, your input cost alone is:
2,050 tokens × 5 exchanges × 100,000 sessions = 1,025,000,000 tokens
1,025,000,000 / 1,000,000 × $3 = $3,075 just for input
That’s before you count the output tokens. Output is typically priced higher than input, often 3-5x.
The three biggest cost drivers in most production AI apps:
- Long system prompts repeated on every request
- Large conversation histories included in every turn
- More expensive models used where cheaper ones would suffice
Model Selection Is Your Biggest Lever
Not every task needs the most capable model. The cost difference between model tiers is often 10-20x, and for a large fraction of production use cases, the cheaper model produces outputs that are good enough.
| Tier | Example | Use When |
|---|---|---|
| Small (fast, cheap) | GPT-4o mini, Gemini Flash, Haiku | Classification, extraction, short answers, routing |
| Mid-tier | GPT-4o, Gemini Pro, Sonnet | General Q&A, summarization, code completion |
| Large (expensive) | GPT-5, Gemini Ultra, Opus | Complex reasoning, multi-step tasks, high-stakes outputs |
The practical approach: run your existing prompts against a cheaper model and measure output quality on a sample of real inputs. For many classification and extraction tasks, you’ll find the cheaper model performs within 5% on your metrics while costing 90% less.
async def route_to_model(user_intent: str, complexity_score: float) -> str:
"""Route requests to appropriate model based on complexity."""
if complexity_score < 0.3:
return "gpt-4o-mini" # Simple Q&A, extraction
elif complexity_score < 0.7:
return "gpt-4o" # General tasks
else:
return "gpt-4-turbo" # Complex reasoning
Intent classification itself can run on a cheap model. You pay $0.002 to route a request to the right $0.02 model instead of the $0.10 one.
Prompt Compression
System prompts grow over time. Every edge case your team encountered became another paragraph. Every quirk you discovered got documented inline. After six months, many production system prompts are 3-4x longer than they need to be.
Compress them:
# Before: 900 tokens
system_prompt = """
You are a helpful customer support assistant for Acme Corp. Your job is
to help users with questions about their orders, shipping, and returns.
You should always be polite and professional. If a user asks about
something unrelated to orders, shipping, or returns, you should let
them know that you can only help with those topics. Never share
information about other customers. Always verify order numbers before
discussing specific order details. If the user seems upset, acknowledge
their frustration before providing information...
[continues for 800 more tokens]
"""
# After: 200 tokens
system_prompt = """
You are Acme support. Help with: orders, shipping, returns only.
Rules: verify order# before discussing details; never share other customer data;
acknowledge frustration before giving info; decline unrelated topics politely.
"""
The compressed version costs 78% less per request. Test it against the full prompt on 50 real support queries. If quality is equivalent, ship it.
For structured tasks (extraction, classification, translation), you can often reduce prompts to a few short sentences without hurting output quality.
Conversation History Management
Conversation history is expensive because every exchange adds tokens to every subsequent request. A 10-turn conversation costs roughly 10x more per turn than a single-turn request, because you’re resending the entire history.
Three options:
Fixed window: Keep the last N exchanges. Simple to implement. Works for most support bots.
MAX_HISTORY_TURNS = 6
def build_messages(history: list, new_message: str) -> list:
recent_history = history[-MAX_HISTORY_TURNS * 2:] # Pairs of user/assistant
return [
{"role": "system", "content": system_prompt},
*recent_history,
{"role": "user", "content": new_message}
]
Summarization: When history exceeds a threshold, compress earlier exchanges into a summary. More expensive upfront, cheaper over long conversations.
Selective retention: Keep only exchanges relevant to the current topic. Requires extracting intent from each message — adds latency, but cuts cost for long sessions.
For most applications, a fixed window of 6-8 turns is the right starting point. Run it for a week, check whether truncated history causes support failures, and adjust from there.
Output Length Control
Output tokens cost more than input tokens. Long outputs are expensive. For tasks where output length matters less than content, constrain it:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=500, # Set an explicit limit
temperature=0.3
)
# Also: end your prompts with length guidance
system_prompt += "\nRespond in 3 sentences or fewer."
The max_tokens limit prevents pathological cases where the model generates a 2,000-word response to a simple question. Instruction-based length limits (in the prompt) reduce typical output length but don’t cap it — use both.
Caching for Repeated Requests
If your application receives similar queries repeatedly, exact-match caching cuts cost to near zero for those requests:
import hashlib
from functools import lru_cache
def cache_key(model: str, messages: list) -> str:
content = f"{model}:{str(messages)}"
return hashlib.sha256(content.encode()).hexdigest()
async def cached_llm_call(model: str, messages: list) -> dict:
key = cache_key(model, messages)
cached = await redis_client.get(key)
if cached:
return json.loads(cached)
response = await call_llm(model, messages)
# Cache for 24 hours for stable content (docs Q&A, etc.)
await redis_client.setex(key, 86400, json.dumps(response))
return response
Exact-match caching works well for FAQ bots, documentation assistants, and any context where the same question gets asked repeatedly. It doesn’t help with unique user queries.
Tracking Costs in Production
You can’t reduce what you don’t measure. At minimum, log token usage and estimated cost for every LLM call:
def log_llm_usage(model: str, usage: dict, metadata: dict = None):
# Rough cost estimates — update when pricing changes
pricing = {
"gpt-4o": {"input": 0.0000025, "output": 0.00001},
"gpt-4o-mini": {"input": 0.00000015, "output": 0.0000006},
}
price = pricing.get(model, {"input": 0.000003, "output": 0.000012})
cost = (
usage["prompt_tokens"] * price["input"] +
usage["completion_tokens"] * price["output"]
)
print(json.dumps({
"event": "llm_call",
"model": model,
"prompt_tokens": usage["prompt_tokens"],
"completion_tokens": usage["completion_tokens"],
"cost_usd": round(cost, 6),
**(metadata or {})
}))
With this in place, query your logs to find which features or endpoints are driving the most cost. Almost always, 20% of request types drive 80% of spend, and the culprit is either a long system prompt, an unbounded conversation history, or the wrong model tier.
Where to Start
If you’re getting unexpected API bills and don’t know where to start:
- Add cost logging to every LLM call this week. One day of data will show you which endpoint is most expensive.
- Check your system prompt length. If it’s over 500 tokens, compress it and test.
- Check your conversation history management. If you’re sending the full history on every turn, cap it at 6 turns.
- Compare your current model against the tier below on a sample of real inputs. The quality difference is often smaller than expected.
These four changes alone tend to cut most teams’ LLM bills by 40-60% without changes that users notice.
Sponsored
More from this category
More from AI Integration
LLM Evals in Practice: Testing AI Features Before They Go Wrong
Feature Flags for AI Features: Shipping Safely When Outputs Are Non-Deterministic
Writing AI IDE Rules That Actually Work: Cursor, Windsurf, and Copilot
Sponsored
The dispatch
Working notes from
the studio.
A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored