An AI vendor change is safest when it is treated as a product migration, not a model swap. The team should prove that the new route improves a defined customer outcome, keeps data controls intact, and can be rolled back without a code deploy.
Migration Checklist at a Glance
- Write one reason for the change and one primary metric before changing the adapter.
- Inventory prompts, schemas, tools, safety behavior, logging, billing, and data controls that depend on the current provider.
- Compare synchronous and batch routes by deadline, failure handling, file limits, and real cost per accepted output.
- Replay production-like examples side by side and require written pass/fail gates before any canary.
- Keep external data, batch files, dashboards, and logs inside the same security review as the production workflow.
- Roll out in phases, monitor the first production week, and keep rollback callable until support and billing evidence is clean.
As of 2026-04-23, the pricing, limits, and behaviors below are summarized from the provider sources listed at the end of this post. Provider pricing and model availability change frequently – verify those pages before quoting in a contract, RFP, or cost plan.
This checklist comes from migration reviews across support classification, summarization, agent/tool-call, evaluation, and enrichment workflows. The repeated failure patterns were practical: broken JSON, changed refusal behavior, missing request IDs, row-order joins in batch output, prompt caching assumptions, and cost estimates that ignored retries. Here, customer-safe means no unresolved increase in latency, unsupported refusals, parser failures, data exposure, or support traceability gaps.
A vendor change should begin with a source review, not a model ranking. For the routes in scope, capture the current batch, tool-use, safety, data-control, and pricing rules from OpenAI, Anthropic, Google Vertex AI, Amazon Bedrock, and Azure OpenAI before the rollout ticket is approved.[1][2][3][4][5]
1. Clarify Why the Vendor Is Changing
Write one migration reason and one primary metric before touching the adapter. A cost migration should be measured as cost per successful task, not only cost per token. A quality migration should be measured on accepted outputs, reviewer defects, or downstream task completion. A reliability migration should be measured on provider error rate, timeout rate, and fallback rate. A batch migration should be measured against the job deadline, not against chat latency.
- Cost example: move nightly ticket tagging to batch only if no user is waiting for the result and the completed-job cost per accepted ticket falls below the current route.
- Quality example: move a code explanation workflow from an OpenAI GPT-family route to a Claude Sonnet-tier or Gemini-family route only after internal examples improve, not because a public benchmark is higher.
- Context example: move long support threads only after measuring how often the current route truncates retrieved messages and whether the new route changes the truncation rule.
- Tool-calling example: move an agent only after every tool argument schema, retry rule, idempotency key, and failure message has been replayed against the new provider.
For async workloads, use provider limits as routing constraints, not as the migration argument. The table below keeps only the differences that change architecture, testing, or customer promises.
| Route or provider | Decision-relevant source signal | Migration question |
|---|---|---|
| OpenAI Batch | 50% lower cost is documented with a 24-hour completion window and input-file request and size limits.[1] | Can the workflow wait up to a day, and are records keyed by custom_id before export? |
| Anthropic Message Batches | Batch docs and pricing describe a 50% discount, a 24-hour expiration window, and large batch request limits.[2][6] | Does the job tolerate expiration, and are failed records requeued without duplicate customer actions? |
| Vertex AI Gemini batch | Gemini batch prediction documents a 50% batch rate, high request limits, Cloud Storage inputs, up to 72 hours of queue time before expiration, and exclusion from the SLO of any SLA.[3] | Does the customer deadline survive queue risk, and is the storage bucket governed like production data? |
| Azure OpenAI Batch | Azure describes lower cost than global standard with a 24-hour target turnaround.[5] | Are regional, content-filtering, and enterprise logging requirements the same as the current route? |
| Amazon Bedrock batch | Bedrock pricing says select foundation models support lower batch pricing, while the batch docs exclude provisioned models.[7][4] | Is the target model supported on the selected capacity mode, and does the output location meet retention rules? |
For shortlist math, the optional Deep Digital Ventures AI Models page can help compare token pricing, context windows, modalities, and public benchmark fields. Treat it as a planning aid, then confirm final rates and limits in the provider sources before the rollout ticket closes.
2. Inventory Current Vendor Dependencies
Build the inventory from code, prompts, logs, dashboards, and billing exports. A provider adapter is rarely just one HTTP client. It usually encodes role names, stop conditions, retry behavior, parser assumptions, safety responses, and token fields that downstream systems already depend on.
Use this table as a pass/fail checklist. A row passes only when the evidence is captured in the migration ticket and one named owner has accepted the risk.
| Dependency | What to capture before migration | Customer-safe migration check |
|---|---|---|
| Prompt hierarchy | System, developer, user, retrieval, and hidden guardrail text, plus ordering rules. | Replay the same examples and diff whether the new model obeys the highest-priority instruction. |
| Output schema | JSON Schema, enum values, nullable fields, markdown contracts, and parser retries. | Run validation before human review; block rollout if parser failures rise above the written gate. |
| Context and retrieval | Input token distribution, retrieved document count, truncation rule, and citation mapping. | Compare answers on long-context examples and record whether omitted evidence changes the answer. |
| Tool calls | Provider tool format, argument schema, retries, side effects, and timeout handling. | Replay against fake tools first; compare OpenAI function-calling flow with Anthropic tool-use semantics if both are in scope.[8][9] |
| Safety behavior | Refusal text, moderation route, escalation copy, and product policy mapping. | If Azure OpenAI is in scope, review its content-filter categories and severity levels before comparing refusal rates.[10] |
| Logging | Request IDs, model IDs, prompt hashes, output hashes, token fields, and user-visible error IDs. | Confirm support can trace one customer complaint from UI event to provider request without exposing raw sensitive data. |
| Billing | Input tokens, output tokens, cached tokens, batch labels, tool charges, and failed-request treatment. | Recalculate cost per successful task from provider usage records, not from estimates alone. |
| Data controls | Training use, retention, region, encryption, admin access, and batch-file storage. | Get security review signoff before canary traffic, especially if prompts include customer confidential data. |
A practical gate is concrete. For example: schema failures stay within 0.1 percentage points of baseline, no P0 or P1 workflow adds a new unresolved refusal type, support traces five sampled outputs end to end, and finance approves cost per accepted result after retries and failed jobs are counted.
3. Run Side-by-Side Tests
Public benchmarks are useful for shortlisting, but they are not a migration test. Record a 2026-04-23 benchmark snapshot for any public score you cite, including LMArena, SWE-bench, MMLU, GPQA, and HumanEval.[11][12][13][14][15] Then let production-like examples decide the route.
Use this four-step mini-workflow for a support-ticket classifier migration from a synchronous route to a lower-cost batch route.
- Export 1,000 recent tickets: 700 common categories, 200 escalation or revenue-sensitive tickets, and 100 historical failures, policy-sensitive examples, or malformed inputs.
- Shortlist two candidate routes by model family, modality, context window, benchmark fields, and per-million-token pricing, then confirm batch limits in the sources table.
- Run the current and target routes on the same prompts, with fixed sampling settings where the provider supports them, and store model ID, input tokens, output tokens, validation result, refusal flag, reviewer grade, and request identifier.
- Approve the canary only if schema failures are no worse than baseline by more than 0.1 percentage points, every new refusal in the top workflows has a written disposition, and cost per accepted ticket improves after retries and failed jobs are counted.
Blind review matters when tone, reasoning, and usefulness affect customers. Have reviewers compare outputs without seeing the vendor name, then label each pair as current wins, target wins, tie, or both fail. Automated checks should run first so reviewers spend time on valid outputs, not broken JSON.
The gate should name the approver: product for outcome changes, support for traceability and escalation copy, security for data movement, and finance for unit cost. If one owner cannot sign off, keep the workflow on the old route.
4. Protect Data Controls
A vendor migration is a fresh data review. The team should list exactly what leaves your system, where files are stored, who can open dashboards, and which logs contain prompt or output text.
- OpenAI: review platform data controls, including training-use settings and default abuse-monitoring retention.[16]
- Google Vertex AI: review data governance, zero data retention, cache settings, abuse-monitoring logging, and Gemini cache retention notes.[17]
- Amazon Bedrock: review the data-protection statement on prompt and completion storage, logging, and model training use.[18]
- Azure OpenAI: review content filtering because filtering can change both prompt handling and completion behavior before your application sees the final response.[10]
The pass condition is concrete: the security reviewer can point to the storage location, retention window, access log, deletion process, and dashboard access list for the new route. Batch files deserve the same controls as production databases.
If a JSONL file contains customer text, store it in the right bucket or container, set expiration, log access, and delete failed input files after the retention window in your security review. Do not move a workflow to a cheaper async route if the file path creates a new data exposure.
5. Plan for Output Differences
Higher model quality can still break the product. A longer answer can overflow the UI. A more cautious safety policy can block valid customer work. A better tool caller can duplicate an action if your idempotency layer is weak. Decide which differences are acceptable before any customer sees them.
Use the table as a failure-pattern checklist during replay. Each row should have at least one saved example that passed on the old route and one example that historically failed.
| Output difference | Customer risk | Pre-rollout test |
|---|---|---|
| Longer or shorter answers | The UI can feel noisy, clipped, or under-explained. | Compare character count and token count distributions, then update display limits and summaries. |
| Different tone | Support, sales, or medical-adjacent copy can sound off-brand or overconfident. | Blind review examples from the highest-traffic workflows and write tone rules into the prompt. |
| More refusals | Valid users can be blocked from normal tasks. | Review every new refusal in the top workflows and map each to product policy. |
| Less strict formatting | Downstream parsers, automations, and exports can fail. | Use schema validation; if OpenAI is in scope, compare JSON mode assumptions with Structured Outputs.[19] |
| Different tool calls | External actions can be missed, duplicated, or called with incomplete arguments. | Replay with fake tools and require idempotency keys before live tool execution. |
| Different citations | Reviewers may trust an answer that is not tied to the retrieved source. | Require citations to map to stored retrieval document IDs, not only to generated text. |
Prompt caching is another output-adjacent dependency because it changes cost and sometimes timing assumptions around long prompts. If the migration uses Claude, review Anthropic prompt caching, including the 5-minute and 1-hour cache durations.[20] If it uses Gemini on Vertex AI, review context caching before assuming cache and batch discounts combine.[21]
6. Use a Phased Rollout
A safe rollout makes the new route earn traffic. The old route should remain callable until rollback has been tested in staging and in a real canary.
- Offline replay: run saved examples through both vendors and block obvious schema, safety, and cost regressions.
- Internal traffic: route employees or test accounts through the new provider with support-visible request IDs.
- Canary traffic: start with low-risk tasks, not the highest-revenue or regulated workflow.
- Expansion: increase traffic only after the metrics window clears the written gates.
- Cutover: remove the old route only after rollback is no longer needed for open support cases, invoices, or audit trails.
Before the canary, decide whether the workload belongs on an interactive or async route.
| Workload | Recommended route | Reason |
|---|---|---|
| User waiting in chat, checkout, support, or an agent session | Synchronous endpoint | Batch APIs are built for asynchronous jobs; the main providers document completion windows or target turnaround in hours, not interactive response times.[1][2][3][5] |
| Nightly classification, evaluation runs, enrichment, and backfills | Batch endpoint | The job can wait, and the documented discounts can matter more than immediate response time. |
| Provisioned-capacity Bedrock workload | Check platform support first | Bedrock batch docs state that batch inference is not supported for provisioned models.[4] |
| Gemini batch job with a customer deadline | Batch only if deadline survives queue risk | Vertex AI Gemini docs describe queue time of up to 72 hours before expiration and note that batch inference is excluded from the SLO of any SLA.[3] |
The rollback plan should be a tested control, not a paragraph in a launch doc. Keep the old provider credentials, old prompt template, old parser, feature flag, and last-known-good model route available until the canary has survived support review and billing review.
A practical rollback test is simple: replay one recent customer issue through the old route, switch the flag in staging, confirm the same parser and support trace appear, and document who can execute the switch after hours.
7. Monitor After the Switch
The migration is not done when traffic moves. Watch the first production week for issues that offline tests miss, especially rare prompt shapes, throttling, hidden parser failures, and cost changes from retries.
- Latency: track
p50,p95, andp99by workflow, provider, model family, and route type. - Reliability: track timeout rate, provider error rate, retry count, and fallback count.
- Cost: track cost per successful task, not only token spend.
- Token accounting: separate input, output, cached, batch, and tool-related usage fields.
- Schema health: count validation failures, parser retries, and downstream rejected records.
- Safety: count refusal rate, moderation blocks, escalation messages, and user appeals.
- Customer signal: sample support tickets, user corrections, thumbs-down feedback, and manual overrides that mention AI behavior.
- Batch health: track created, completed, failed, canceled, and expired jobs, plus missing output files.
Do not join batch results by row order. OpenAI Batch uses custom_id for matching, Anthropic Message Batch results state that result order is not guaranteed and should be matched by custom_id, and Amazon Bedrock batch input uses recordId.[1][22][23] Store those identifiers in your migration logs before traffic moves.
The first-week exit criteria should be written before launch: no unresolved severity-one support tickets tied to the new route, no unexplained cost increase above the agreed threshold, no schema-health regression beyond the gate, and sampled customer-facing outputs signed off by product and support.
A Customer-Safe Vendor Change
Ship the vendor change only when three gates are true: the new route passes the dependency inventory, the customer outcome improves on your saved examples, and rollback works without a code deploy. If any gate fails, keep the old route for that workflow and narrow the migration to the tasks where the evidence is already clean.
FAQ
How much regression is acceptable in an AI vendor migration?
Use a written threshold by workflow. For structured outputs, a common gate is no more than 0.1 percentage points above the current parser-failure rate. For customer-facing copy, require blind-review ties or target wins on the highest-traffic examples and written approval for every new refusal pattern.
Can batch inference be used for customer-facing workflows?
Only when the product experience is already asynchronous and the customer promise allows it. A dashboard enrichment job can wait; a checkout assistant, live support agent, or in-session tool call usually cannot.
What evidence belongs in the rollback ticket?
Include the feature flag, old route, last-known-good model ID, prompt template, parser version, data-control settings, support trace example, billing comparison, and the person allowed to execute rollback. The ticket should prove customer behavior can be restored, not just that an API key can be switched.
Sources
- OpenAI Batch API – https://platform.openai.com/docs/guides/batch
- Anthropic Message Batches API – https://docs.anthropic.com/en/api/creating-message-batches
- Google Vertex AI batch inference for Gemini – https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini
- Amazon Bedrock batch inference – https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html
- Azure OpenAI Batch API – https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/batch
- Anthropic batch pricing – https://docs.anthropic.com/en/docs/about-claude/pricing
- Amazon Bedrock pricing – https://aws.amazon.com/bedrock/pricing/
- OpenAI function calling – https://platform.openai.com/docs/guides/function-calling
- Anthropic tool use – https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/implement-tool-use
- Azure OpenAI content filtering – https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter
- LMArena leaderboard – https://lmarena.ai/leaderboard/
- SWE-bench benchmark – https://www.swebench.com/SWE-bench/
- MMLU benchmark paper – https://arxiv.org/abs/2009.03300
- GPQA benchmark paper – https://arxiv.org/abs/2311.12022
- HumanEval benchmark paper – https://arxiv.org/abs/2107.03374
- OpenAI platform data controls – https://platform.openai.com/docs/models/how-we-use-your-data
- Vertex AI data governance and zero data retention – https://cloud.google.com/vertex-ai/generative-ai/docs/data-governance
- Amazon Bedrock data protection – https://docs.aws.amazon.com/bedrock/latest/userguide/data-protection.html
- OpenAI Structured Outputs – https://platform.openai.com/docs/guides/structured-outputs
- Anthropic prompt caching – https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- Vertex AI context caching – https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview
- Anthropic Message Batch results – https://docs.anthropic.com/en/api/retrieving-message-batch-results
- Amazon Bedrock batch input format – https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference-data.html