Provider pricing, cache behavior, and batch discounts change. Treat the examples below as current cost-engineering patterns and verify live vendor terms before changing production routing.
Most teams try to cut AI cost by switching models. That can help, but it is rarely the fastest cost lever. In many production systems, the bigger savings come from changing how you use the model you already chose: cache stable prompt prefixes, batch anything non-urgent, cap output aggressively, route simple cases to cheaper paths, and stop paying for tokens users never read.
That is why cost engineering matters. The real question is not just which model looks cheapest on a pricing page. It is whether your application is repeatedly paying full price for the same context, generating twice as much text as the user needs, or sending routine traffic to a high-cost model when a lower-cost model could have finished the job. Those are usage-design problems, not procurement problems.
Plain-English orientation
- Prompt caching means reusing the stable part of a prompt so repeated requests can be billed at a lower cached-input rate.
- Batch pricing means sending non-urgent work through an asynchronous API path that can be cheaper than live requests.
- Output tokens are the tokens the model generates, and they often cost more than input tokens.
- This article is for founders, product leads, and engineers running AI features that already have meaningful usage or soon will.
Key takeaways
- Output tokens often dominate spend because output rates are commonly several times higher than input rates.
- Prompt caching works best when the expensive part of the prompt stays stable across calls and the variable part is appended later.
- Batch processing is one of the cleanest cost levers for offline and non-urgent jobs because several major providers now discount batch traffic materially.
- Routing, truncation, and schema discipline can reduce cost without changing the product if users do not actually need long answers.
- The AI Models compare view is useful here because it lets teams compare input, cached input, output pricing, and max output side by side instead of relying on one headline token rate.
The short version
These ranges are directional, not guarantees. The right move depends on input/output mix, latency needs, quality thresholds, and how much traffic is eligible for each change.
| Lever | Best use case | Likely savings range | Tradeoff |
|---|---|---|---|
| Prompt caching | Classification, extraction, RAG, and agents with long repeated instructions or context | 10-70% on affected high-input workflows when cache hits are strong | Requires stable prompt shape and cache-hit monitoring |
| Batch pricing | Evals, backfills, enrichment, nightly summaries, and bulk moderation | Up to 50% on eligible batch traffic | Adds latency and job-management complexity |
| Output caps and schema discipline | Summaries, drafts, support replies, and JSON responses that tend to ramble | 10-40% where responses are currently verbose | Bad caps can shorten useful answers |
| Routing and escalation | Products with a mix of simple and hard requests | 10-60% when routine jobs are common | Needs tests, confidence thresholds, and fallback rules |
| Retrieval cleanup and truncation | RAG systems, long chats, and prompts with stale context | 5-30% when prompts are bloated | Over-trimming can remove context the answer needs |
Why output tokens are often the real bill
Teams usually notice prompt size first because prompts are visible in logs and prompt editors. The invoice often tells a different story. Premium and upper-mid models frequently charge far more for output than for input, which means a workflow that feels prompt-heavy during testing can become output-heavy in production.
This matters most in applications that draft emails, summarize calls, generate reports, explain code, produce long JSON payloads, or return detailed step-by-step answers. Every extra paragraph, bullet list, reasoning trace, or verbose field pushes cost upward. If users only need the first answer, a tight answer, or a structured object with ten fields, paying for 800 unnecessary tokens is not intelligence. It is waste.
Pricing note: The examples below are text-centric, standard API prices as of April 2026. They are not a quality ranking and exclude taxes, volume deals, priority tiers, storage, tool calls, and regional modifiers. Check vendor pages before changing production routing.[1][2][3]
| Provider example | Standard input pricing | Cached or cache-hit pricing | Output pricing | What it implies |
|---|---|---|---|---|
| OpenAI GPT-5.4[1] | $2.50 / 1M | $0.25 / 1M cached input | $15.00 / 1M | Output is 6x input, so long answers can dominate cost quickly. |
| Anthropic Claude Sonnet 4.6[2] | $3 / 1M base input | Cache read at $0.30 / 1M; 5-minute write at $3.75 / 1M | $15 / 1M | Cache reuse is powerful when the same prefix is read repeatedly. |
| Google Gemini 2.5 Pro, prompts up to 200K tokens[3] | $1.25 / 1M | $0.125 / 1M context cache plus storage | $10.00 / 1M, including thinking tokens | Output control matters because generated and thinking tokens are billed together. |
The exact rates will move, but the pattern is consistent: output is expensive enough that teams should manage it deliberately. In practice, trimming 400 output tokens from a high-volume workflow can matter more than trimming 400 input tokens.
Prompt caching is a cost lever, not just a latency trick
Prompt caching is most useful when your application keeps reusing the same expensive prefix: system instructions, policy blocks, product catalogs, internal playbooks, or long conversation history. If that repeated prefix can be cached, you stop paying full price for it on every request.
The economic logic is straightforward. OpenAI exposes separate cached-input pricing, so repeated prompt prefixes can be billed far below normal input rates. Anthropic’s public pricing shows cache reads at 0.1x base input for Sonnet 4.6 and cache writes above base input. Google’s Gemini pricing similarly separates context-cache pricing from standard input and notes a storage charge.[1][2][3] That is not a niche optimization. It is a practical production control when the prompt repeats often enough.
Caching only works when prompt shape is stable enough to hit. If you reorder large sections every call, interpolate volatile data into the middle of the prompt, or stuff retrieved passages ahead of the reusable instructions, you will keep busting the cache. A better pattern looks like this:
- Put the stable, expensive prefix first.
- Version your long system prompt instead of rewriting it casually.
- Append variable user data and retrieved context after the reusable blocks.
- Measure cache-hit rate as a first-class KPI, not as trivia in provider logs.
Why caching helps classification more than generation
Prompt caching’s ROI is inverse to output/input ratio. Workflow A (2K input, 4K output on GPT-5.4): uncached $0.065, fully cached $0.0605 — caching saves 7%. Workflow B (10K input, 500 output): uncached $0.0325, fully cached $0.010 — caching saves 69%. Caching matters enormously for classification, extraction, and RAG pipelines with high input and low output. For generation-heavy pipelines, output control may be the better first project. Measure your input/output ratio before investing engineering time in caching architecture.
A worked before/after from one support workflow
A real but anonymized support-summary workflow looked like this: each ticket run reused a 9,000-token policy and product context, added about 1,000 tokens of case-specific data, and generated roughly 700 output tokens. Before cleanup, the job sent the full 10,000-token input every time and allowed long narrative summaries. After cleanup, the stable 9,000-token prefix stayed first, cache-hit rate rose to 82%, and the response schema was capped around 450 output tokens.
Using GPT-5.4 list pricing as a simple model, the per-ticket model cost moved from roughly $0.0355 to about $0.0115 before retries and overhead.[1] That is not a universal benchmark. It is a reminder that the meaningful savings came from two small changes working together: repeated context stopped being full-price input, and the answer stopped producing text the team did not use. Moving older backfill runs into batch would reduce the eligible portion further, but only for work that can wait.
Batching turns slow work into cheaper work
Batch pricing is one of the cleanest ways to lower cost without touching product behavior, because many workloads do not need an answer in two seconds. Nightly summarization, backfills, classification jobs, enrichment pipelines, bulk moderation, eval runs, and report generation are good candidates for batch even if your interactive product is not.
OpenAI says its Batch API saves 50% on input and output and runs asynchronous jobs over 24 hours. Anthropic describes batch processing as a 50% savings path for asynchronous workloads. Google’s paid tier lists the Batch API as a 50% cost reduction, and its model tables show lower batch rates for supported models.[1][2][3]
The implication is simple: if your team is running non-urgent traffic through the same interactive path as user-facing chat, you may be paying the wrong price class for the job. Cost engineering is partly about moving work into the right queue.
| Workload | Should it stay interactive? | Better cost-engineering move |
|---|---|---|
| Nightly ticket summaries | No | Batch them and let the cheaper asynchronous path do the work. |
| Large eval suites | No | Run them in batch and compare model quality separately from latency. |
| Immediate support replies | Usually yes | Keep the user-facing turn interactive, but batch the analytics and QA around it. |
| CRM enrichment and tagging | No | Queue it for batch unless the salesperson truly needs instant results. |
Output control is usually the fastest win nobody owns
Prompt engineering gets attention. Output governance often does not. That is a mistake because output is where many AI products quietly overspend. Teams ask for exhaustive answers when concise ones would do, request multiple variants when the UI shows one, or accept bloated structured output because no one has budget ownership over response length.
There are several practical ways to fix this without changing the product promise:
- Set realistic
max_output_tokenslimits for each workflow instead of using one global ceiling. - Ask for the smallest acceptable schema, not the most descriptive schema imaginable.
- Return a short answer first and expand only when the user explicitly asks.
- Remove repeated labels, verbose justifications, and decorative text from JSON or Markdown outputs.
- Separate hidden internal reasoning from user-visible output where your provider and product design allow it.
Google’s Gemini pricing page is a useful reminder here because it explicitly says output pricing includes thinking tokens for supported models.[3] Even when users see a short final answer, an unconstrained response path can still be commercially heavy. The broad lesson applies beyond one provider: if the model is allowed to think or speak at length, you need to know who is paying for that verbosity.
Routing and escalation are usage design, not procurement
Many teams now run a two-path system: a cheaper default model for routine work and a higher-cost model for hard cases. That can be smart, but only when the escalation rule is disciplined. If the expensive path catches too much traffic, the architecture becomes a label instead of a cost control.
A good routing strategy does not start with provider loyalty. It starts with job classes:
- Simple extraction, tagging, classification, and predictable transforms belong on a budget or default model.
- Messy reasoning, high-stakes generation, or complex coding can justify a higher-cost model.
- Escalation should be tied to confidence thresholds, validation failures, or explicit user value, not to vague discomfort.
The goal is not to downgrade the experience. The goal is to stop using the most expensive model as the default answer to every request just because it feels safer. In many businesses, the real savings do not come from replacing one provider with another. They come from ensuring only the right share of traffic reaches the expensive path at all.
Truncation, retrieval hygiene, and prompt shape matter more than teams expect
There is a point where comparing providers becomes less valuable than cleaning up your own request design. If your RAG layer retrieves too many passages, your conversation state keeps every previous turn forever, or your prompt repeats unchanged boilerplate in the wrong place, you are manufacturing cost before the model even starts answering.
Useful cleanup steps are not glamorous, but they work:
- Trim retrieval to the smallest set of passages that preserves answer quality.
- Deduplicate context before it enters the prompt.
- Summarize stale conversation history instead of replaying the full transcript forever.
- Cache the long, reusable context and keep volatile context outside the cached prefix.
- Match
max_outputand response format to what the user actually consumes.
The point is not to make every prompt tiny. The point is to stop treating all context as equally useful. A model that looks cheap on input alone may still be expensive for a verbose workflow, and a model with attractive cached-input economics only helps if your prompt is designed around a stable prefix.
A practical cost-engineering playbook for teams trying to protect margin
If you already have a live AI workflow and do not want to redesign the product, start with a narrow operational review instead of a vendor migration project.
- Measure the current mix: average input tokens, average output tokens, cache-hit rate, retry rate, and escalation rate.
- Sort workflows by output spend, not just total requests.
- Identify repeated prompt prefixes and redesign them for caching.
- Move non-urgent workloads into batch.
- Set workflow-specific output caps and shorter response schemas.
- Review which requests truly need the higher-cost model path.
This sequence matters because it improves cost without forcing a new product position. You are not promising worse answers. You are removing unnecessary spend from the same user outcome.
FAQ
Can these levers be combined?
Yes, but measure them separately. Caching reduces repeated input cost, batching changes the price class for non-urgent work, output caps reduce generated tokens, and routing changes the traffic mix. Treat each one as its own experiment so you know what actually moved the bill.
How do we know an output cap is too aggressive?
Watch retry rate, regeneration clicks, manual edits, escalation rate, support complaints, and task success. If those move against you after a cap change, the lower token bill may be hiding a worse product outcome.
What metric should finance and engineering review together?
Cost per successful task is usually more useful than cost per request. Pair it with average output tokens, cache-hit rate, retry rate, escalation rate, and the share of traffic processed in batch.
Does FAQ schema still help this kind of post?
Use FAQ content only when it adds real reader value. Google’s search guidance emphasizes helpful, original content and clear pages over padding, and FAQ rich results are now limited mainly to well-known government or health sites.[4][5][6][7]
The teams that control AI cost best are usually not the teams that obsess over one pricing screenshot. They are the teams that engineer how tokens are consumed in production. Prompt caching, batching, output control, routing, and truncation are not minor implementation details. They are the controls that determine whether an AI feature can keep its unit economics under pressure.
If you want to compare those controls across models without jumping between vendor pages, AI Models puts input, cached input, output, max output, and estimator context in one place. That makes it easier to improve cost before turning every optimization discussion into a provider migration debate.
Sources
- OpenAI API pricing: https://openai.com/api/pricing/
- Anthropic API pricing: https://www.anthropic.com/pricing
- Google Gemini API pricing: https://ai.google.dev/pricing
- Google helpful content guidance: https://developers.google.com/search/docs/fundamentals/creating-helpful-content
- Google SEO starter guide: https://developers.google.com/search/docs/fundamentals/seo-starter-guide
- Google AI features guidance: https://developers.google.com/search/docs/appearance/ai-features
- Google FAQ schema guidance: https://developers.google.com/search/docs/appearance/structured-data/faqpage