OpenAI Assistants, Responses API, and LangChain: Which Abstraction Should Developers Use?

Use the Responses API for new OpenAI work. Keep Assistants only for migration. Reach for LangChain or LangGraph when the product needs orchestration above one provider: multiple model routes, retrieval pipelines, durable state, human review, evaluations, or recovery after partial failure.

Last reviewed: 2026-04-23. Pricing, limits, deprecation dates, and model availability change frequently. Treat the numbers below as architecture inputs, then verify them on the provider pages before quoting them in a contract, RFP, or cost plan.

Before choosing the abstraction, write down the workload: target model tier, input and output token budget, context window, modality, whether the user is waiting, and whether the job can run overnight. A compact model comparison table is enough; the Deep Digital Ventures AI Models comparison tools can help when you need prices, context windows, and benchmark notes in one place.

Best Choice by Scenario

ScenarioBest first choiceMain tradeoff
New GPT-family assistant, extraction flow, or multimodal featureResponses APILeast framework weight and fastest access to OpenAI tools; your app owns state, retries, traces, and cost controls.
Existing product already built on Assistants, Threads, Runs, or Run StepsAssistants migration planPreserves user history and tool behavior while moving off a retiring surface; new work should not deepen the dependency.
Nightly evals, classification, extraction, embeddings, or backfillsProvider batch mode around the chosen APILower async cost and higher throughput; not appropriate for user-blocking actions or dependent chains.
Workflow with retrieval, human approval, durable state, or multiple model familiesLangChain or LangGraphLess custom orchestration code; more framework surface to test, observe, and operate.
Cloud placement, IAM, data residency, S3, BigQuery, or Azure quota is the constraintCloud-native model pathFits the platform boundary; portability usually drops.

The short version is practical: choose the smallest layer that protects the product guarantee. A single request with a narrow output contract does not need a graph. A long-running review process with human approval usually does. Batch APIs, prompt caching, and benchmark routing are delivery and cost decisions around that architecture, not replacements for it.

The Current OpenAI Direction

OpenAI’s migration guidance says the Responses API is recommended for all new projects and describes it as the future direction for agentic applications on OpenAI.[1] The same guidance says OpenAI deprecated the Assistants API on August 26, 2025, with a sunset date of August 26, 2026.

That date matters for roadmaps. A team starting a new support copilot in Q2 2026 should not create new Assistants objects unless it is extending a legacy system that must be retired before August 26, 2026. A team already using Assistants, Threads, Runs, and Run Steps should treat the next sprint as migration design, not greenfield architecture.

Delivery and Limits That Actually Affect Architecture

Provider limits matter when they decide whether a workload is synchronous, asynchronous, or split across files. The rest belongs in implementation notes, not the main abstraction decision.

Provider pathDecision numbers worth keepingArchitecture consequence
OpenAI Batch50% discount, 24-hour completion window, up to 50,000 requests, up to 200 MB input file.[2]Good fit for independent OpenAI jobs that can wait; still use Responses as the programming surface.
Anthropic Message Batches50% of standard API prices, 24-hour expiration window, 100,000 requests or 256 MB limit.[3]Useful for Claude workloads when delay is acceptable and request independence is clear.
Vertex AI Gemini batch inference50% batch discount, up to 200,000 requests, 1 GB Cloud Storage input file, queue can last up to 72 hours, excluded from the Vertex AI SLA SLO.[4]Strong cloud batch lane when Google Cloud storage, quota, or data placement is already part of the system.
Azure OpenAI BatchSeparate batch quota, 24-hour target turnaround, 50% less cost than global standard.[5]Useful when Azure tenant controls, networking, or quota separation are the main constraints.
Amazon Bedrock batch inferenceAsynchronous jobs read from Amazon S3 and write outputs to S3; batch inference is not supported for provisioned models.[6]Best when the data plane is already S3 and the workload does not depend on provisioned throughput.

Prompt caching is another delivery lever. Anthropic prices 5-minute cache writes at 1.25 times base input, 1-hour cache writes at 2 times base input, and cache hits at 0.1 times base input.[7] If a long system prompt or policy pack repeats across thousands of requests, caching can matter as much as batch. If every request is unique, it will not rescue a poor architecture.

Benchmark leaderboards should inform model routing only after the product shape is clear. Preference rankings help for user-facing writing quality; coding benchmarks help for repair agents; knowledge tests help for research tasks. None of them decides whether the control loop belongs in direct API calls, a migration layer, or a graph.

When to Use the Responses API Directly

Use Responses directly when the workflow is centered on OpenAI’s models and your application can own the control loop. Typical examples are a support assistant that calls your ticket API, a document extraction job that returns structured JSON, a code-review helper that needs file search, or a multimodal workflow that combines text and image inputs without routing through several model vendors.

When to use it

  • Use Responses for a synchronous user action when the user is waiting, such as "summarize this uploaded policy PDF" or "draft a reply to this support ticket."
  • Use OpenAI Batch around Responses for offline jobs that can complete within the documented 24-hour window, such as nightly evals, large classification runs, and embedding backfills.
  • Use OpenAI function calling when the model needs a small set of app-owned tools, such as `get_invoice_status`, `create_refund_case`, or `search_internal_docs`.[8]
  • Keep the first version direct when the workflow is one model call plus one or two deterministic tool calls; the extra graph layer needs to earn its place with state, branching, or recovery requirements.

When not to use it

  • Do not make direct calls the only abstraction if product requirements already include provider fallback, human approval, or resumable multi-step state.
  • Do not hide a large workflow inside one controller method just because the first prototype worked; retry behavior and partial failure will become hard to reason about.
  • Do not use synchronous Responses for jobs where nobody is waiting and the provider’s batch lane fits the delay window.

Main tradeoff

The tradeoff is control versus scaffolding. Direct Responses calls reduce framework lock-in and keep the product code close to the provider surface, but your team must build the boring parts well: retries, logging, idempotency, eval fixtures, rate-limit handling, and cost alerts.

The practical rule is simple: if the user is waiting, design for synchronous latency and clear retry boundaries; if nobody is waiting and each request is independent, price the same prompt through batch and caching. Then compare candidate models in the Deep Digital Ventures AI Models pricing and context window view before deciding which route gets the first production test.

When Assistants Still Matters

Assistants matters because production apps still contain Assistants objects, Threads, Runs, Run Steps, tool resources, vector stores, and user-facing history tied to those objects. OpenAI’s migration guidance does not describe Assistants as a new-project target; it describes how to move older concepts into the Responses model before the August 26, 2026 shutdown.[1]

When to use it

  • Use Assistants only to keep a legacy product stable while the migration is being designed and shipped.
  • Inventory Assistants by product surface, model, instructions, tools, stored files, and active users; migrate the highest-traffic assistant first.
  • Move new conversations first, then backfill old Threads only when the user returns, an audit requires it, or the business needs the history in the new system.

When not to use it

  • Do not start a new application on Assistants to save a week of design work.
  • Do not postpone migration until the shutdown quarter; stored conversations, file search behavior, and tool-call audit trails are the parts that take time.
  • Do not backfill every historical Thread by default if retention policy, support risk, and active usage do not justify it.

A real migration scenario: a support copilot has one Assistant per product line, Threads linked to ticket IDs, file search over policy PDFs, and an admin dashboard that shows Run Steps for audit. I would not big-bang that. First recreate each Assistant as a versioned prompt configuration, put Conversations and Responses behind a route flag for new tickets, keep old Threads read-only for returned tickets, and write golden tests for ticket lookup, file search, structured escalation output, and cost. Only after the new path matches behavior should you backfill history that support or compliance actually needs.

Main tradeoff

The tradeoff is continuity versus calendar risk. Keeping the old surface alive reduces user disruption during the transition, but every new feature built there creates another object, test path, and support habit that must be unwound before August 26, 2026.

When LangChain or LangGraph Makes Sense

LangChain and LangGraph are not replacements for OpenAI, Anthropic, Google, Azure OpenAI, or Bedrock model APIs. They are application frameworks around those APIs. They start to make sense when the product has multiple model families, retrieval steps, custom tools, human approvals, durable state, or evaluation hooks that would otherwise become a private framework inside your codebase.

When to use it

  • Use LangGraph when a workflow has branches that must be resumed, such as "extract facts, search policy, ask human reviewer, then draft final answer." LangGraph describes itself as a low-level framework for long-running, stateful agents with durable execution and human-in-the-loop support.[9]
  • Use LangChain when provider adapters, loaders, retrievers, rerankers, model wrappers, and LangSmith traces remove repeated glue code across teams.
  • Use a framework when the workflow has enough shared conventions that not using one would create a private orchestration framework anyway.

When not to use it

  • Do not add LangGraph to a support chatbot that calls one model and one internal search endpoint.
  • Do not use a graph to compensate for an unclear product contract; state machines make bad requirements more durable.
  • Do not ignore cloud-native batch jobs when IAM, data residency, S3, BigQuery, or Azure quota is the real constraint.

When this goes wrong, the failure mode is usually not model quality. A team wraps a two-call classifier in a graph, stores partial state, and then retries a node that already wrote to the CRM. The graph resumes, but the side effect repeats and the customer record gets duplicate tags. Durable execution only helps if the checkpointer, thread identifier, and idempotent tool writes are designed together.[10]

Tool calling is a good dividing line. OpenAI function calling connects models to app-defined functions by schema.[8] Anthropic’s tool-use docs distinguish client tools, where your app executes a tool request, from server tools, where Claude executes the tool.[11] LangGraph becomes attractive when those calls form a stateful workflow with retries, review gates, or model fallback.

Main tradeoff

The tradeoff is orchestration speed versus operational surface. LangChain and LangGraph can prevent teams from rebuilding adapters, state machines, and review gates. They also add versioning, tracing, dependency, and debugging questions that a simple direct API path would not have.

A Simple Decision Rule

Use this rule tomorrow: if the product is centered on OpenAI and the user waits for the answer, start with Responses; if the same work is offline and independent, evaluate the provider’s batch lane around that API; if the workload already runs on Assistants, migrate it; if the workflow needs cross-model orchestration, long-running state, human review, or retrieval pipelines, evaluate LangChain or LangGraph around the model APIs.

Worked example: a nightly product-review classification job should not start with LangGraph just because it has many rows. Step 1: decide whether a user is waiting; for nightly classification, no. Step 2: check request count, file size, and completion window against the batch table. Step 3: compare candidate models for token price, context window, modality, and task fit. Step 4: run a small synchronous sample first so you can test schema quality, then move the full offline job to the provider’s batch path if the delay fits the business process.

The same job changes architecture if it becomes user-facing. A checkout fraud review, live support triage, or admin console action should use a synchronous endpoint because the user or transaction is waiting. A weekly eval run, policy backfill, review-labeling job, or embeddings refresh can use async processing when delayed completion is acceptable and provider limits fit the input.

Do not let a leaderboard pick the abstraction. Use benchmark families as evidence for model routing after you know the product shape: preference signals for writing quality, coding tasks for repair agents, science and knowledge tasks for research assistants. Then test the exact prompt, tool schema, latency target, and cost envelope your product will run in production.

FAQ

Should developers still start new projects on the Assistants API?

No. Use Assistants only when an existing product already depends on it and the immediate work is migration or continuity. New feature work should target Responses unless there is a legacy constraint that expires with the migration plan.

How much work is an Assistants-to-Responses migration?

The model call is rarely the hard part. The work is usually prompt versioning, Thread history policy, file search behavior, tool-call compatibility, audit records, and test fixtures. A small app can move quickly; a support or compliance workflow with stored history needs staged routing and backfill rules.

Does LangChain reduce lock-in?

It can reduce provider-specific glue code, especially for retrieval and model adapters, but it does not make prompts, tool schemas, evals, latency, or cost behavior portable for free. The lock-in moves from one API surface to a framework plus the conventions your team builds around it.

How should teams test and observe these systems?

Keep golden prompts, structured-output fixtures, tool-call contract tests, latency budgets, and cost alerts close to the application code. For graph workflows, also test resume behavior, duplicate side effects, human approval branches, and version upgrades.

Does batch processing change the architecture?

Usually no. Batch changes delivery mode, cost, quota behavior, and completion timing. It should not turn a one-step classifier into a graph, and it should not make a user-facing workflow wait for an overnight queue.

What should be in the first architecture ticket?

Write the user wait state, model candidates, token budget, tool list, state owner, retry policy, eval fixtures, batch eligibility, and rollback plan. That ticket will tell you whether Responses, a migration layer, or LangChain/LangGraph is the right starting point.

Sources

  1. [1] OpenAI Responses migration guidance and Assistants deprecation timeline: https://platform.openai.com/docs/guides/migrate-to-responses
  2. [2] OpenAI Batch API limits and discount: https://platform.openai.com/docs/guides/batch
  3. [3] Anthropic Message Batches pricing and limits: https://docs.anthropic.com/en/docs/build-with-claude/batch-processing
  4. [4] Google Vertex AI Gemini batch inference limits: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini
  5. [5] Microsoft Azure OpenAI Batch behavior and quota: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/batch
  6. [6] Amazon Bedrock batch inference workflow: https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html
  7. [7] Anthropic prompt caching pricing behavior: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
  8. [8] OpenAI function calling with Responses: https://platform.openai.com/docs/guides/function-calling?api-mode=responses
  9. [9] LangGraph overview and long-running agent positioning: https://docs.langchain.com/oss/python/langgraph/overview
  10. [10] LangGraph durable execution requirements: https://docs.langchain.com/oss/python/langgraph/durable-execution
  11. [11] Anthropic tool-use concepts for client and server tools: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/implement-tool-use