AI API bills usually get painful for boring reasons: the same policy text gets sent again and again, offline jobs run through live endpoints, retry loops repeat bad requests, and easy classification rides the same route as hard reasoning. You can cut a lot of spend before touching the quality of the user-visible answer.
Last verified: 2026-04-23. Provider pricing, cache behavior, limits, and model availability change often; verify the source pages before quoting figures in a contract or cost plan.
The practical levers are simple: measure cost per successful task, cache stable context, retrieve less irrelevant text, batch work when users are not waiting, route by difficulty, retry only with a better request, and count validation failures as real cost.
Key takeaways
- Cut waste before downgrading the model behind a user-facing answer.
- Compare routes by successful outputs, not by raw token price.
- Move volatile provider limits and discounts into your cost model, not into hard-coded assumptions.
- Keep quality guardrails visible so savings do not become worse answers with a smaller invoice.
Define response quality before cutting cost
Without making responses worse has to mean something measurable. For each workflow, set guardrails before changing the route: eval pass rate should stay flat or improve, factual claims should match retrieved evidence, required JSON fields and enums should validate, synchronous p95 latency should not regress, and escalation rate or CSAT should not move the wrong way. A cheaper route that creates more support tickets is not cheaper.
Start with waste, not quality cuts
1. Keep a cost ledger by workflow
Begin with a usage ledger, not a model downgrade. Log model, endpoint, input tokens, cached input tokens, output tokens, tool calls, retry count, validation status, and request class for every production call. Then measure cost per successful task: total billed AI cost divided by outputs that pass your schema, factuality checks, and product eval.
2. Remove static boilerplate from live prompts
Version long policy text, tool descriptions, and examples instead of pasting them into every request as fresh context. Cache stable prefixes where the provider supports it, especially repeated system instructions, policy preambles, and tool schemas.
3. Retrieve the needed evidence instead of attaching the whole corpus
A policy assistant should send the passages it cites, not a whole handbook on every turn. This often improves answers because the model sees less irrelevant text and the product pays for fewer distracting tokens.
4. Batch non-urgent work
Batch is a waste-removal tool when the user is not waiting. Eval runs, enrichment jobs, embeddings, moderation backfills, product-feed rewrites, and support-ticket tagging usually need throughput and valid rows more than live latency.
| Lever | When it changes the decision | Current facts to verify |
|---|---|---|
| Cached input | Stable prefixes repeat across many calls | OpenAI lists separate cached-input pricing; Anthropic lists cache-write premiums and 0.1x cache reads; Vertex AI says implicit caching can discount cached tokens and cache and batch discounts do not stack.[1][2][3] |
| Batch discounts | The job can wait for provider turnaround | OpenAI, Vertex AI, and Azure document 50% batch discounts for eligible work; OpenAI and Azure target 24 hours, while Vertex AI notes possible queue expiration and no service-level objective.[3][4][6] |
| Batch limits and plumbing | You need to size files, queues, and storage | OpenAI, Anthropic, and Vertex AI publish request and file-size limits; Bedrock runs asynchronous JSONL jobs through S3 inputs and outputs.[3][4][5][7] |
Route by task difficulty
A smaller model tier may be enough for classification, formatting, tagging, extraction, and short rewrites. Reserve stronger routes for ambiguous tool use, long synthesis, coding tasks, regulated customer-facing answers, and requests that fail cheaper routes.
Use public benchmarks as weak routing signals, not policy by themselves. Preference arenas can inform conversational feel, coding benchmarks can inform repair routes, and knowledge benchmarks can narrow candidates for research-heavy workflows; your own prompts, schemas, documents, latency needs, and failure costs decide production routing.[8][9][10][11][12]
For a practical shortlist, compare model pricing per million input and output tokens, context window sizes, modalities, and relevant benchmark scores beside your logs. Use your real prompt and output sizes, not a vendor example.
| Workflow | Default route | Promote when | Cost lever |
|---|---|---|---|
| Ticket classification, sentiment, tags | Cheapest tier that passes your labeled eval | Low confidence, missing label, or schema failure | Structured output, small prompt, batch for backfills |
| Customer answer with retrieved policy | Mid-tier model with retrieval and cached policy prefix | Conflicting sources, high-value account, or compliance language | Retrieve cited passages only; cache stable policy text |
| Bulk content rewrite | Batch endpoint when no user is waiting | Brand or legal review flags the item | Provider batch route with an error queue |
| Code repair or agent task | Model chosen by repository eval, not a general chat leaderboard | Patch fails tests, tool plan is unclear, or task spans many files | Internal coding evals and a retry cap |
| Extraction into JSON | Model with schema support and low output cap | Required field is absent or enum is invalid | Schema tools, function calls, and short outputs |
Worked example: support workflow
Suppose a support product handles 100,000 tickets a month. The old route sends a 4,000-token policy block, 1,200 tokens of ticket context, and a 500-token draft. Eight percent fail validation and retry once with the same prompt. That is about 615.6 million prompt-and-output token units: 100,000 x 1.08 x 5,700.
After the routing pass, retrieval sends 900 tokens of cited policy instead of the full block, a 900-token instruction and tool prefix is cached where supported, drafts are capped at 300 output tokens, and only disputed refunds, legal threats, and enterprise accounts promote. Validation failures fall to 3% because the prompt now includes the exact policy evidence.
The live draft route now carries about 288.4 million token units before cache discounts: 100,000 x 1.03 x 2,800. The visible response did not get thinner; the model sees cleaner evidence and the product pays for fewer repeated and invalid tokens. If 30,000 nightly relabeling tasks move to a batch route with a documented 50% discount, that background line is cut in half without changing chat latency.[3][4][6]
Validate before retrying blindly
5. Retry only after changing the request
Retries are expensive when the product repeats the same flawed request. If the same prompt failed once, the next call should add missing evidence, tighten the schema, lower ambiguity, raise the output cap, or use a different route. Provider schema and tool features can reduce retry loops when the task needs JSON or structured tool inputs.[13][14][15]
6. Set a retry budget by risk
For low-value background jobs, allow one repair attempt and then write the row to an error queue. For customer-facing answers, promote the request instead of looping. If the second output fails the same validation, stop and record the failure class.
7. Track validation failures as a cost metric
A route with cheap tokens but many invalid outputs may cost more per successful task than a stronger model with fewer retries. Do not call the optimization finished until cost per successful task falls, synchronous p95 latency holds for user-facing routes, and eval pass rate stays flat or improves.
FAQ
Should we always start with the cheapest model? No. Start with the cheapest route that passes the workflow eval. A cheap model that fails extraction, cites the wrong policy, or needs repeated retries is not cheap once you measure cost per successful task.
When should batch be the default? Batch should be the default when the user is not waiting and the job can tolerate the provider completion window. Evals, offline labeling, embeddings, enrichment, and backfills are usually better candidates than chat replies because latency is less important than throughput and valid output.
Do public benchmark scores decide the route? No. Benchmarks help narrow candidates, but production routing should be based on your prompts, schemas, retrieved documents, latency needs, and failure costs. A model that wins a general preference test may still be the wrong choice for your extraction or code-repair workflow.
What is the first change to make tomorrow? Add model, endpoint, token, cache, retry, validation, and workflow labels to your logs. Without that ledger, model switching is guesswork and savings are hard to separate from hidden quality loss.
Sources
1. OpenAI pricing: https://platform.openai.com/docs/pricing/
2. Anthropic prompt caching: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
3. Google Vertex AI Gemini batch and caching behavior: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini
4. OpenAI Batch API: https://platform.openai.com/docs/guides/batch
5. Anthropic Message Batches: https://docs.anthropic.com/en/api/creating-message-batches
6. Azure OpenAI batch: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/batch
7. Amazon Bedrock batch inference: https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html
8. LMArena leaderboard: https://lmarena.ai/leaderboard/
9. SWE-bench Verified: https://www.swebench.com/verified.html
10. GPQA benchmark paper: https://arxiv.org/abs/2311.12022
11. MMLU benchmark paper: https://arxiv.org/abs/2009.03300
12. HumanEval benchmark: https://github.com/openai/human-eval
13. OpenAI Structured Outputs: https://platform.openai.com/docs/guides/structured-outputs
14. OpenAI function calling: https://platform.openai.com/docs/guides/function-calling
15. Anthropic tool use: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/implement-tool-use