LLM API Costs Are Out of Control: A Production Guide to Cutting Your Bill

You build the feature. Users like it. Then the invoice lands and you spend the next two hours explaining to finance why your app spent $8,000 on an API last month.

This story repeats constantly across teams that moved fast with LLM integrations without building cost controls first. The inputs are simple: token counts and prices. The outputs, for anything with non-trivial traffic, are surprisingly large numbers.

Here’s how to understand where the money goes and what actually reduces it.

Where the Money Goes

LLM pricing is straightforward in theory: you pay per token, split between input (the prompt) and output (the completion). In practice, most teams underestimate how their input tokens accumulate.

Consider a customer support chatbot:

System prompt: 800 tokens (your instructions)
Conversation history: 1,200 tokens (last 6 exchanges)
User message: 50 tokens
Total input: 2,050 tokens — per request

If you’re running on a model priced at $3/million input tokens, and you handle 100,000 support sessions per month with an average of 5 exchanges each, your input cost alone is:

2,050 tokens × 5 exchanges × 100,000 sessions = 1,025,000,000 tokens
1,025,000,000 / 1,000,000 × $3 = $3,075 just for input

That’s before you count the output tokens. Output is typically priced higher than input, often 3-5x.

The three biggest cost drivers in most production AI apps:

Long system prompts repeated on every request
Large conversation histories included in every turn
More expensive models used where cheaper ones would suffice

Model Selection Is Your Biggest Lever

Not every task needs the most capable model. The cost difference between model tiers is often 10-20x, and for a large fraction of production use cases, the cheaper model produces outputs that are good enough.

Tier	Example	Use When
Small (fast, cheap)	GPT-4o mini, Gemini Flash, Haiku	Classification, extraction, short answers, routing
Mid-tier	GPT-4o, Gemini Pro, Sonnet	General Q&A, summarization, code completion
Large (expensive)	GPT-5, Gemini Ultra, Opus	Complex reasoning, multi-step tasks, high-stakes outputs

The practical approach: run your existing prompts against a cheaper model and measure output quality on a sample of real inputs. For many classification and extraction tasks, you’ll find the cheaper model performs within 5% on your metrics while costing 90% less.

async def route_to_model(user_intent: str, complexity_score: float) -> str:
    """Route requests to appropriate model based on complexity."""
    if complexity_score < 0.3:
        return "gpt-4o-mini"      # Simple Q&A, extraction
    elif complexity_score < 0.7:
        return "gpt-4o"           # General tasks
    else:
        return "gpt-4-turbo"      # Complex reasoning

Intent classification itself can run on a cheap model. You pay $0.002 to route a request to the right $0.02 model instead of the $0.10 one.

Prompt Compression

System prompts grow over time. Every edge case your team encountered became another paragraph. Every quirk you discovered got documented inline. After six months, many production system prompts are 3-4x longer than they need to be.

Compress them:

# Before: 900 tokens
system_prompt = """
You are a helpful customer support assistant for Acme Corp. Your job is
to help users with questions about their orders, shipping, and returns.
You should always be polite and professional. If a user asks about 
something unrelated to orders, shipping, or returns, you should let 
them know that you can only help with those topics. Never share 
information about other customers. Always verify order numbers before
discussing specific order details. If the user seems upset, acknowledge
their frustration before providing information...
[continues for 800 more tokens]
"""

# After: 200 tokens  
system_prompt = """
You are Acme support. Help with: orders, shipping, returns only.
Rules: verify order# before discussing details; never share other customer data;
acknowledge frustration before giving info; decline unrelated topics politely.
"""

The compressed version costs 78% less per request. Test it against the full prompt on 50 real support queries. If quality is equivalent, ship it.

For structured tasks (extraction, classification, translation), you can often reduce prompts to a few short sentences without hurting output quality.

Conversation History Management

Conversation history is expensive because every exchange adds tokens to every subsequent request. A 10-turn conversation costs roughly 10x more per turn than a single-turn request, because you’re resending the entire history.

Three options:

Fixed window: Keep the last N exchanges. Simple to implement. Works for most support bots.

MAX_HISTORY_TURNS = 6

def build_messages(history: list, new_message: str) -> list:
    recent_history = history[-MAX_HISTORY_TURNS * 2:]  # Pairs of user/assistant
    return [
        {"role": "system", "content": system_prompt},
        *recent_history,
        {"role": "user", "content": new_message}
    ]

Summarization: When history exceeds a threshold, compress earlier exchanges into a summary. More expensive upfront, cheaper over long conversations.

Selective retention: Keep only exchanges relevant to the current topic. Requires extracting intent from each message — adds latency, but cuts cost for long sessions.

For most applications, a fixed window of 6-8 turns is the right starting point. Run it for a week, check whether truncated history causes support failures, and adjust from there.

Output Length Control

Output tokens cost more than input tokens. Long outputs are expensive. For tasks where output length matters less than content, constrain it:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=500,      # Set an explicit limit
    temperature=0.3
)

# Also: end your prompts with length guidance
system_prompt += "\nRespond in 3 sentences or fewer."

The max_tokens limit prevents pathological cases where the model generates a 2,000-word response to a simple question. Instruction-based length limits (in the prompt) reduce typical output length but don’t cap it — use both.

Caching for Repeated Requests

If your application receives similar queries repeatedly, exact-match caching cuts cost to near zero for those requests:

import hashlib
from functools import lru_cache

def cache_key(model: str, messages: list) -> str:
    content = f"{model}:{str(messages)}"
    return hashlib.sha256(content.encode()).hexdigest()

async def cached_llm_call(model: str, messages: list) -> dict:
    key = cache_key(model, messages)
    
    cached = await redis_client.get(key)
    if cached:
        return json.loads(cached)
    
    response = await call_llm(model, messages)
    
    # Cache for 24 hours for stable content (docs Q&A, etc.)
    await redis_client.setex(key, 86400, json.dumps(response))
    return response

Exact-match caching works well for FAQ bots, documentation assistants, and any context where the same question gets asked repeatedly. It doesn’t help with unique user queries.

Tracking Costs in Production

You can’t reduce what you don’t measure. At minimum, log token usage and estimated cost for every LLM call:

def log_llm_usage(model: str, usage: dict, metadata: dict = None):
    # Rough cost estimates — update when pricing changes
    pricing = {
        "gpt-4o": {"input": 0.0000025, "output": 0.00001},
        "gpt-4o-mini": {"input": 0.00000015, "output": 0.0000006},
    }
    
    price = pricing.get(model, {"input": 0.000003, "output": 0.000012})
    cost = (
        usage["prompt_tokens"] * price["input"] +
        usage["completion_tokens"] * price["output"]
    )
    
    print(json.dumps({
        "event": "llm_call",
        "model": model,
        "prompt_tokens": usage["prompt_tokens"],
        "completion_tokens": usage["completion_tokens"],
        "cost_usd": round(cost, 6),
        **(metadata or {})
    }))

With this in place, query your logs to find which features or endpoints are driving the most cost. Almost always, 20% of request types drive 80% of spend, and the culprit is either a long system prompt, an unbounded conversation history, or the wrong model tier.

Where to Start

If you’re getting unexpected API bills and don’t know where to start:

Add cost logging to every LLM call this week. One day of data will show you which endpoint is most expensive.
Check your system prompt length. If it’s over 500 tokens, compress it and test.
Check your conversation history management. If you’re sending the full history on every turn, cap it at 6 turns.
Compare your current model against the tier below on a sample of real inputs. The quality difference is often smaller than expected.

These four changes alone tend to cut most teams’ LLM bills by 40-60% without changes that users notice.

LLM API Costs Are Out of Control: A Production Guide to Cutting Your Bill

Where the Money Goes

Model Selection Is Your Biggest Lever

Prompt Compression

Conversation History Management

Output Length Control

Caching for Repeated Requests

Tracking Costs in Production

Where to Start

Feature Flags for AI Features: Shipping Safely When Outputs Are Non-Deterministic

Scoping AI Projects for Clients: The Questions That Prevent Expensive Mistakes

More from AI Integration

LLM Evals in Practice: Testing AI Features Before They Go Wrong

Feature Flags for AI Features: Shipping Safely When Outputs Are Non-Deterministic

Writing AI IDE Rules That Actually Work: Cursor, Windsurf, and Copilot

Working notes from
the studio.

Join the conversation.

Where the Money Goes

Model Selection Is Your Biggest Lever

Prompt Compression

Conversation History Management

Output Length Control

Caching for Repeated Requests

Tracking Costs in Production

Where to Start

Feature Flags for AI Features: Shipping Safely When Outputs Are Non-Deterministic

Scoping AI Projects for Clients: The Questions That Prevent Expensive Mistakes

More from AI Integration

LLM Evals in Practice: Testing AI Features Before They Go Wrong

Feature Flags for AI Features: Shipping Safely When Outputs Are Non-Deterministic

Writing AI IDE Rules That Actually Work: Cursor, Windsurf, and Copilot

Working notes fromthe studio.

Join the conversation.

Working notes from
the studio.