Feature flags make AI launches reversible. Instead of asking whether a new model, prompt, retrieval flow, or tool-calling feature is "ready," use a flag to decide who gets it, what it is allowed to do, and how quickly you can turn it off when real production inputs expose a problem.
Last verified: April 23, 2026. Provider pricing, limits, and model availability change frequently; use the source table at the end as a check before quoting any number in a contract, RFP, or cost plan.
What This Means
- Put the flag around the AI behavior, not just the button that opens it.
- Start with draft-only or review-only output before the system writes data, sends messages, or changes account state.
- Roll out by segment, watch quality and cost together, and expand only after two clean measurement windows.
- Keep a kill switch that can return traffic to the previous route without a redeploy.
What to Put Behind an AI Feature Flag
A button flag is not enough for an AI launch. The risky part may be the model, prompt, endpoint choice, retrieval index, tool schema, or action policy that runs after the user clicks.
- Model route: choose a new model route through a provider API such as the Responses API[1], an alternate route, or the previous non-AI path.
- Prompt version: ship
support_summary.v2026_04_23beside the prior prompt and log the prompt version on every output. - Endpoint mode: flag synchronous calls separately from batch, offline inference, or evaluation-only jobs.
- Retrieval source: choose the help-center index, product-docs index, customer-specific index, or no retrieval for unsupported tasks.
- Tool mode: control whether the model can call zero, one, or multiple tools; function-calling guidance makes clear that your application executes the tool call, not the model.[2]
- Action level: start with draft-only output before user-confirmed, human-reviewed, or automatic actions that write data, send messages, or change account state.
- Fallback: keep the previous prompt, previous model route, deterministic search path, or human queue available until the new route has passed production checks.
Function-calling guidance suggests keeping the exposed function set small for accuracy; one useful review threshold is 20 tools.[2] If one AI feature needs more than 20 exposed tools, split the workflow or hide lower-priority tools behind a separate allowed-tools flag.
Prompt caching is also behavior, not infrastructure trivia. Some providers price cache writes and cache reads differently from base input tokens, so cache policy should be visible in rollout telemetry when repeated long prompts drive cost.[3]
How to Roll Out AI Features Safely
The first rollout group should prove one specific thing: the route works for real production inputs without exposing the riskiest workflow first. For AI features, segment by user, workflow, model route, endpoint mode, and action level.
| Segment | Entry rule | Expansion signal |
|---|---|---|
| Internal users | Company accounts only, draft-only output, prompt and model route logged. | No unauthorized tool execution and no blocked fallback during one business day of use. |
| Beta customers | Opt-in accounts with support owner, rollback contact, and visible feedback channel. | Accepted-output rate beats the prior route and support tickets do not rise by more than 2 percentage points. |
| Low-risk workflows | Summaries, tags, or recommendations that a user can discard before saving. | Schema-valid output rate is at least 99% for two measurement windows. |
| Small traffic percentage | Start at 1% to 5% of eligible synchronous requests, not all requests. | Fallback rate and validation-failure rate stay within baseline plus 2 percentage points. |
| Batch or offline work | Only jobs whose results can wait for provider batch windows and can be checked before publication. | Completed outputs reconcile with input IDs, error files are reviewed, and no output is auto-published. |
The endpoint choice belongs in the segment plan too. Batch and offline routes can reduce cost and remove live-user latency, but they also add queueing, larger files, reconciliation work, and delayed failures. Use them for jobs that can be checked before publication, not for interactions where the user expects an immediate answer.
Last-Verified Batch Planning Snapshot
| Route | Rollout decision it affects | Planning note | Source |
|---|---|---|---|
| OpenAI Batch | Offline evals, bulk classification, backlog tagging. | Discounted versus synchronous APIs, 24-hour turnaround target, and file/request limits. | [4] |
| Anthropic Message Batches | Large async jobs that can wait for completion. | Discounted batch pricing, request and file caps, and completion windows documented separately. | [5][6] |
| Vertex AI batch inference for Gemini | High-volume jobs already using Google Cloud storage and review workflows. | Discounted compared with real-time inference, queue expiration, SLA exclusions, and cache-discount precedence. | [7] |
| Azure OpenAI Batch | Enterprise Azure deployments with async workloads and storage controls. | Target turnaround, discounted pricing, and different file limits depending on storage setup. | [8] |
| Amazon Bedrock batch inference | S3-based input and output pipelines with model-specific quotas. | Not supported for provisioned models; pricing and quota details vary by selected foundation model. | [9][10][11] |
Worked Example: Support Summary Rollout
In a support-summary rollout, use this workflow before the flag moves beyond beta.
In one anonymized B2B support rollout, the first beta cohort exposed a failure staging missed: long enterprise tickets produced valid JSON with the wrong priority because the retrieved context included old SLA language. The flag caught it because beta traffic was capped at 5%, automatic writes were disabled, and expansion required priority-tag disagreement below 1.5% plus 99% schema validity across two measurement windows.
- Shortlist two model routes and one fallback path in the model compare sheet, then define flags for
model_route,prompt_version,endpoint_mode, andaction_level. - Run the new prompt on a frozen evaluation set of real, redacted tickets and require at least 99% JSON schema validity before any customer sees output.
- Expose internal users to draft-only summaries first; block tool calls that update tickets, send replies, or change priority.
- Move beta accounts to 5% of eligible synchronous requests only if fallback rate, validation failures, priority disagreement, and support ticket labels stay within the release thresholds.
- For nightly backlog tagging, use batch only when the job fits the selected provider’s current batch limits and you can reconcile every output ID before publication.
- Expand traffic only after two measurement windows pass; if either window fails, set
action_level=draft_onlyormodel_route=previousbefore changing the prompt again.
What an AI Kill Switch Should Control
Every customer-facing AI flag needs a fast disable path that changes behavior without a redeploy. The kill switch should control the model route, the action policy, and any queued work that can surface later.
ai_feature_enabled=false: return the previous deterministic workflow.model_route=previous: switch from the new model route back to the last accepted route.prompt_version=previous: keep the new model but restore the prior prompt when the model is not the problem.action_level=draft_only: stop automatic writes while still collecting reviewer feedback.human_review_required=true: route high-risk outputs to a reviewer before the user sees them.batch_submit_enabled=false: stop new batch jobs and quarantine existing output files until an owner reviews them.
Test the kill switch before expanding past beta. A practical release check is to disable the feature, confirm that one synchronous request uses the non-AI path, confirm that one queued batch job is not published automatically, and record the operator, timestamp, old value, and new value in the rollout log within 5 minutes.
What to Measure Before Expanding an AI Feature
Do not expand because there are no crashes. AI quality can degrade through bad grounding, wrong tool arguments, hidden cost, long tail latency, or outputs that look fluent but fail a business rule.
- Usage by segment: requests, users, accounts, and workflow type for each flag value.
- Latency: p50, p95, timeout rate, and provider route for synchronous requests.
- Batch completion: submitted records, completed records, failed records, expired records, and unmatched output IDs.
- Cost per accepted task: input tokens, output tokens, cached input tokens, batch tier, and fallback attempts divided by accepted outputs.
- Validation failures: schema errors, citation errors, missing required fields, unsafe content blocks, and tool argument rejects.
- Human review: approval rate, edit rate, rejection rate, and top rejection reason.
- User feedback: thumbs-down rate, undo rate, regenerate rate, and support tickets tagged to the AI feature.
- Fallback frequency: provider errors, rate-limit retries, retrieval misses, and manual fallback actions.
The OpenAI pricing page notes that reasoning tokens are not visible through the API but still occupy the model context window and are billed as output tokens.[12] That is why rollout dashboards should track total provider-billed tokens and cost per accepted task, not only the visible answer length.
How to Compare AI Model and Prompt Variants
Feature flags are useful for model and prompt comparisons only when the test has one changed variable, stable traffic assignment, and guardrail metrics that can stop the rollout before the winner is declared.
| Variant | Primary metric | Guardrail |
|---|---|---|
| Prompt A vs prompt B | Task success, schema validity, and reviewer edit rate. | No increase in unsafe output, missing citations, or malformed JSON. |
| New model route vs fallback route | Accepted-output rate and cost per accepted task. | p95 latency, fallback rate, timeout rate, and provider error rate. |
| Draft suggestion vs automatic action | User acceptance, undo rate, and support impact. | Automatic action disabled if review rejection rises above baseline plus 2 percentage points. |
| Retrieval source A vs retrieval source B | Grounded answer rate and citation accuracy. | Missing-source and wrong-source defects reviewed before expansion. |
| Batch vs synchronous route | Cost per completed offline task. | No user-facing request waits on a 24-hour batch window. |
Public benchmarks can screen model candidates, but they should not approve a rollout by themselves. Keep only benchmarks that map to the work: broad reasoning benchmarks such as MMLU[13] can help for knowledge-heavy tasks, while SWE-bench Verified[14] is more relevant to software-engineering agents. Treat both as shortlist inputs; the expansion decision still belongs to your eval set, production feedback, latency, cost, and fallback data.
Useful Planning Tool
When candidates get serious, compare routes in Deep Digital Ventures AI Models for pricing per million input and output tokens, context windows, modalities, public benchmark scores, an in-page compare sheet, and a cost estimator panel. Use it before the rollout plan is locked; the flag decision should still come from your own evals and production data.
How to Avoid AI Feature Flag Sprawl
AI flags become hard to debug when prompt versions, model routes, retrieval indexes, cache policies, and action modes can combine without ownership. Treat every AI flag as temporary unless it is a permanent product permission.
- Name the axis in the flag, such as
ai.support_summary.model_route,ai.support_summary.prompt_version, orai.support_summary.action_level. - Record the owner, creation date, expected review date, source docs used, and rollback value.
- Review temporary rollout flags within 30 days after the route reaches 100% of eligible traffic.
- Delete losing prompt and model variants after the decision is made and logs no longer need live lookup.
- Limit one request path to two active routing flags unless an incident or experiment owner approves the combination.
- Store model route, prompt hash, retrieval index version, endpoint mode, and action level on each AI output for later audit.
A clean flag set lets an engineer answer the incident question quickly: which model route, prompt, retrieval source, cache policy, and action level produced this exact output?
How Feature Flags Support AI Governance
Feature flags are not only release tools for AI systems. They are the record of who was exposed, what model behavior ran, how much traffic used it, which evidence supported expansion, and who could stop it.
- Access: which accounts, users, tiers, regions, or beta cohorts can use the capability?
- Version: which model route, prompt version, retrieval index, tool schema, and cache policy are active?
- Exposure: what percentage of eligible traffic is routed to the new AI path?
- Evidence: which eval set, benchmark snapshot, cost report, and production metric window support expansion?
- Authority: who can disable the feature, change the route, or force human review during an incident?
A practical expansion rule is simple: move to the next segment only when task success beats the incumbent route, cost per accepted task is inside budget, p95 latency is inside the product target, and guardrail failures stay within baseline plus 2 percentage points for two consecutive measurement windows.
FAQ
Should the UI flag and AI route flag be the same flag?
Usually no. Keep the UI flag separate from the server-side route flag so you can hide the entry point, switch the model route, or force draft-only output independently during an incident.
When should an AI feature use batch instead of synchronous inference?
Use batch for offline evaluation, backlog tagging, bulk classification, embedding jobs, or nightly summaries where a provider’s 24-hour or longer completion window is acceptable. Do not route a live user request to batch unless the product explicitly tells the user the result will arrive later.
What should be disabled first during a risky AI incident?
Disable automatic actions first, then switch to draft-only or human review, then route to the previous prompt or model if quality remains poor. If batch jobs are involved, stop new submissions and quarantine output files before they enter customer-facing systems.
Sources
- OpenAI Responses API reference: https://platform.openai.com/docs/api-reference/responses
- OpenAI function calling guide: https://platform.openai.com/docs/guides/function-calling
- Anthropic prompt caching documentation: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- OpenAI Batch API guide: https://platform.openai.com/docs/guides/batch
- Anthropic Message Batches documentation: https://docs.anthropic.com/en/docs/build-with-claude/batch-processing
- Anthropic Create a Message Batch API reference: https://docs.anthropic.com/en/api/creating-message-batches
- Google Vertex AI batch inference guide for Gemini: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini
- Microsoft Azure OpenAI Batch documentation: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/batch
- Amazon Bedrock batch inference documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html
- Amazon Bedrock pricing page: https://aws.amazon.com/bedrock/pricing/
- AWS General Reference quotas for Bedrock: https://docs.aws.amazon.com/general/latest/gr/bedrock.html
- OpenAI pricing page: https://platform.openai.com/docs/pricing/
- MMLU paper: https://arxiv.org/abs/2009.03300
- SWE-bench Verified benchmark page: https://www.swebench.com/verified.html