Audit LLM Prompts to Reduce Hidden Token Costs

For AI engineers maintaining production prompts, a prompt template teardown is a cost-control pass before a route scales. The payoff is simple: find the repeated text every request sends, keep only the parts that change model behavior, and move stable blocks where caching or batch processing can actually help.

Short version: prompt template teardown means breaking a prompt into measurable blocks: static instructions, tools, examples, retrieved context, and output rules. The hidden cost is repeated input: a few thousand fixed tokens multiplied by thousands of calls. Do it before changing model tiers, after adding tools or retrieval, and whenever a route’s token bill grows faster than traffic.

Provider snapshot, 2026-04-23: pricing, caching, batch, and model availability change frequently; verify the source docs before quoting numbers in a contract, RFP, or cost plan. OpenAI prompt caching depends on exact prefix matches, the 1,024-token threshold, prompt_cache_key, and in-memory or 24-hour retention on supported models.[1] Anthropic caching uses cache_control, defaults to a 5-minute cache, offers a 1-hour TTL at additional cost, and references the prompt prefix across tools, system, and messages.[2] OpenAI, Anthropic, and Vertex AI batch lanes can reduce cost when the job can wait, but they are scheduling choices after the prompt has been cleaned up.[4][5][6]

The hidden cost is usually not one bad sentence. It is repeated input. If a support-ticket classifier carries 3,200 static input tokens and runs 10,000 requests, the static template alone accounts for 32,000,000 input tokens before the user text, retrieved context, or answer appears. That is why model comparison and prompt teardown belong in the same review.

Audit LLM Prompt Blocks

Start by splitting the template into blocks you can measure: permanent instructions, workflow-specific instructions, tool definitions, examples, retrieved context, and output contract. Label each block with two fields: how often it changes and whether the provider can cache or batch it without changing behavior.

Prompt blockWhat to measureTeardown question
Permanent instructionsStatic input tokens per requestCan this move to the beginning of the prompt so cache systems can see an identical prefix?
Tool definitions and schemasToken count for every declared tool, including descriptions and parameter schemasDoes this route need every tool, or only the one or two tools it can call?
Few-shot examplesTokens per example and eval cases each example improvesIs this example teaching a real failure mode, or just restating the output format?
Retrieved contextChunks, characters, and tokens injected per callCan retrieval return a narrower passage, a structured field, or a citation target instead of a full document?
Output contractExpected output tokens and required fieldsDoes the downstream parser need a full explanation, or only a compact JSON object?

Caching only helps when the repeated part is actually repeatable. Put static instructions, examples, tool definitions, and structured output schema before variable user text, retrieved passages, timestamps, request IDs, and account names. If a changing field sits before the cacheable prefix, the repeated block stops behaving like a reusable asset.

Tool schemas deserve a separate audit row. Function and tool definitions consume context and are billed as input tokens, so a route that can only call classify_ticket should not carry the schema for billing lookup, CRM writeback, web search, calendar scheduling, and PDF parsing just because another route in the app uses them.[3]

Cut Repeated Prompt Tokens

  • Few-shot examples that all teach the same thing. Keep the shortest example that changes an eval result. If three examples only show return valid JSON, move that requirement into the output schema and keep one hard edge case, such as a ticket with two possible categories.
  • Policy blocks pasted into every request. Replace a full policy memo with a policy ID, version date, and retrieved clause when the route only needs one rule. The test is simple: if the model never quotes or reasons over the full policy text, do not ship the full policy text on every call.
  • A full tool catalog on narrow routes. For a read-only extraction route, declare the extraction tool and omit write tools. This cuts input tokens and reduces accidental tool-choice surface area.
  • JSON schemas copied from the database model. The model does not need every nullable column, admin-only field, or internal audit property. Send the smallest schema the parser will validate for this route.
  • Repeated account context written as prose. Replace a sentence such as this customer is on the enterprise plan, lives in the EU, prefers English, and has export disabled with structured fields such as plan=enterprise, region=EU, locale=en, and export_allowed=false.
  • Verbose answer rules that force long outputs. If the downstream system only needs label, confidence, and rationale_code, do not require a paragraph explanation. Output tokens are part of provider pricing, so response length is a cost variable, not just a style choice.
  • Synchronous calls for work that can wait. Batch is not a generic make-it-cheaper switch, but it belongs in the review after the prompt has been counted. If the product SLA allows next-day completion, test batch before changing the model tier.

Measure Before and After Prompt Cleanup

Use a teardown worksheet before changing providers. For each route, log input tokens, cached input tokens where the provider reports them, output tokens, latency, validation failures, retry rate, and task accuracy. Then change one block at a time. If you trim examples and change the model in the same test, you will not know which change caused the cost or quality movement.

After you have measured tokens per call, use AI Models to compare candidate model pricing per million input and output tokens, context window sizes, modalities, and public benchmark signals against the prompt size you actually send. The cost estimator is most useful after the template has been counted, because a cheaper model can still be expensive if the route carries a bloated fixed prefix.

A real teardown from our own stack made the pattern obvious. One AI Models enrichment route had a 4,180-token request before the model’s answer: 710 tokens of instructions, 860 tokens of tool schemas, 1,050 tokens of examples, 640 tokens of provider notes, 390 tokens of output rules, and 530 tokens of model-record text. The first version put the model record and date ahead of the stable instructions, so warmed OpenAI runs still reported 0 cached input tokens on that route even when the visible prompt looked almost identical.

After teardown, the route sent 2,510 input tokens: one tool schema instead of four, two examples instead of five, four structured provider fields instead of a prose note, and a three-field JSON contract. The static route/version prefix moved first, and requests using the same cache key reported 1,920 cached input tokens after warmup. On a 200-case eval, required fields stayed at 199/200, extraction agreement with the prior accepted output moved from 94.5% to 95.0%, and average output fell from 126 tokens to 58. The point was not that shorter is always better; it was that the removed text had not been earning its tokens.

Worked example format, using round numbers so the math is visible and not provider pricing: a ticket triage route starts with 650 input tokens of permanent instructions, 900 tokens of tool schema, 1,200 tokens of examples, 800 tokens of retrieved account context, 450 tokens of output rules, and 500 tokens of ticket text. That is 4,500 input tokens per request.

At 40,000 nightly requests, the route sends 180,000,000 input tokens before any retries. The teardown removes two duplicate examples, replaces the full account paragraph with four structured fields, trims the output contract to three fields, and keeps the static prefix stable for caching. The new request is 2,900 input tokens, so the same 40,000 requests send 116,000,000 input tokens, a reduction of 64,000,000 input tokens before batch or cache discounts.

Do not judge that change on cost alone. Run the old and new templates on the same eval set and compare format validity, label accuracy, refusal rate, retry count, output tokens, and p95 latency. If accuracy falls, restore the smallest block that fixes the failure. If accuracy holds and the job can wait, then test the batch lane against the provider docs in the source list.

The decision rule for tomorrow’s cleanup is concrete: remove a block if it does not improve the route’s eval result, move a block later if it changes per request, move a block earlier if it is identical and cacheable, and move the whole job to batch only when the documented completion window fits the product SLA.

FAQ

Is a long prompt always bad?

No. A long static prefix can be worth keeping if it improves eval results and the provider can cache it. The problem is paying for long text that changes per request, sits before the cacheable prefix, or never changes the route’s pass rate.

Should I switch to a cheaper model before trimming the prompt?

Trim first. A cheaper model comparison is cleaner after you know the true input and output tokens per route. Otherwise, the model test is really a test of both provider pricing and template waste.

When is batch the wrong answer?

Batch is wrong when the user is waiting, retries must happen immediately, the input file exceeds provider limits, or the documented completion or queue window misses the product SLA. It is a good follow-up test for async work after the prompt has already been trimmed.

Sources

  1. OpenAI prompt caching: exact-prefix caching, prompt_cache_key, cached-token reporting, and retention policies. https://platform.openai.com/docs/guides/prompt-caching/prompt-caching
  2. Anthropic prompt caching: cache_control, default 5-minute cache, optional 1-hour TTL, and prefix order. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
  3. OpenAI function calling: tool definitions, parameters, and schema context. https://platform.openai.com/docs/guides/function-calling
  4. OpenAI Batch API: async batch processing, 24-hour completion window, limits, and discount language. https://platform.openai.com/docs/guides/batch/
  5. Anthropic batch processing: Message Batches, request/file limits, 24-hour expiration, and discounted usage. https://docs.anthropic.com/en/docs/build-with-claude/batch-processing
  6. Vertex AI batch inference for Gemini: Gemini batch pricing, queueing behavior, and cache/batch discount interaction. https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini