AI memory is information an AI product deliberately keeps and reuses after the immediate turn, such as a preference, a project rule, a retrieval pointer, or a short-lived state value. For teams building AI products, the outcome is not simply a more personalized assistant. The real decision is what should be remembered, who owns it, where it may be used, and when it must expire.
As of 2026-04-23, the pricing, limits, and behaviors below are summarized from the provider docs listed in Sources; provider pricing and model availability change frequently, so verify those pages before quoting in a contract, RFP, or cost plan. Use the same framework before you store anything: source, scope, owner, expiry, and user-visible edit or delete path.
- In this article you’ll learn what AI memory means in an API product, not only in a chat UI.
- How stored preferences, project context, session state, and tool memory differ.
- What should stay session-only because the risk outlives the benefit.
- How to route batch work from a fixed memory snapshot without leaking scope.
What is AI memory?
AI memory features make assistants more useful because the system does not have to start from zero every time. In an API product, that may mean remembering that a user wants CSV output, loading a project glossary before every coding review, passing a prior function result back into the next model call, or keeping a short conversation state inside the current session. The product risk is that those four examples should not have the same lifetime, scope, or deletion rule.
The source material behind this explainer is intentionally narrow: provider docs for response generation[1], batch processing[2][3], prompt caching[4], and tool use[8][9], plus public benchmark references where model choice comes up. Exact limits belong in those docs; this article focuses on the memory design rules they imply.
Stored preferences vs project context
Stored preferences are broad and reusable. A good stored preference says, "default to terse Markdown tables for model comparisons." It does not say, "remember this customer’s payment issue forever." Project context is narrower. It may include a repository name, a glossary, a product decision log, a retrieval collection, or a rule such as "use the API provider’s documented model ID, not a marketing name, in config files." Conversation context is temporary and should expire with the session. Tool memory belongs in the system of record, not in the model’s long-term memory.
- Preference memory: store output style, unit preferences, or evaluation format when the user expects the same behavior across future chats.
- Project memory: store project-specific terms, approved constraints, and source pointers for one workspace or repository.
- Session memory: keep the last error, command output, or retry state only while the current task is active.
- Tool memory: keep account status, billing tier, ticket state, or inventory in your database, then expose it through OpenAI function calling[8] or Anthropic tool use[9] when a request needs it.
Concrete example: ticket labeling with memory
Suppose a support team has 18,000 noninteractive ticket-labeling prompts that must use project memory but do not need a live answer. The model should see the project taxonomy, redacted ticket text, allowed output schema, and a memory snapshot ID. It should not see a user’s full account history, a live billing lookup, or a mutable preference store that can change halfway through the job.
- Normalize the prompt into five fields: preference, project_context_id, retrieval_collection_id, tool_call_allowed, and retention_class.
- Remove user-specific details that are not needed for the label. Keep the project taxonomy and the allowed output schema.
- Snapshot project memory before the batch file is created, then write the memory_snapshot_id into each request.
- Route only independent, noninteractive requests through batch. Keep live authorization, streaming answers, and state-changing tool actions synchronous.
- After completion, compare a sample of outputs against known labels and a synchronous run on the candidate model. If failures cluster around stale project assumptions, fix the memory store before switching models.
The before-and-after change is small but important. Before routing, the model saw one blended prompt with preference, project rules, source text, and tool state mixed together. After routing, the model sees only the project taxonomy, redacted ticket text, output schema, and memory snapshot ID. The billing path can move to batch, but the privacy boundary stays with the project.
What should stay session-only?
A useful design rule is this: if a remembered item cannot be assigned one owner, one scope, one source, one expiry rule, and one user-visible edit path, keep it out of long-term memory. Put it in session context or retrieve it from a source at request time.
| Memory type | Scope | Owner | Expiry | Deletion path |
|---|---|---|---|---|
| Stored preference | Account or user | User | Until changed or idle-retention policy | Memory settings page |
| Project context | Workspace, repository, or client project | Project owner | When archived or superseded | Project memory review |
| Session state | Current task only | System | End of session or retry budget | Automatic expiration |
| Tool memory | Source application or database | Application owner | Source retention policy | Delete or update in the source app |
Memory should be visible and editable
If an AI system uses memory, users should be able to inspect and correct it. A memory record should show at least five fields: the remembered text, the source event, the scope, the creation date, and the last-used date. For team products, add owner and retention class. Without those fields, a stale preference can quietly change future answers, and a project assumption can outlive the decision that created it.
Provider data controls are separate from your product memory controls. OpenAI’s data controls page distinguishes abuse monitoring logs from application state, says API data is not used to train models unless the customer opts in, and documents default abuse-monitoring retention of up to 30 days for many API features.[10] That does not remove your obligation to show, edit, and delete memory inside your own product.
For API teams, the UI should not be the only control plane. Add memory review to the same operational path as prompt versions, model routing, and eval sets. A practical review screen has three filters: "used in the last 30 days," "contains user-supplied personal data," and "shared outside the originating project." The first filter catches stale assumptions. The second catches privacy risk. The third catches scope drift.
How to set data boundaries
Memory should respect account, workspace, team, project, and session boundaries. A tone preference can often be account-wide. A client strategy, health note, payment dispute, unreleased roadmap item, or incident timeline usually cannot. If the same memory store serves several workspaces, the retrieval layer must enforce scope before the model sees the record.
Provider boundaries matter when memory leaves the live chat path. Anthropic’s Message Batches documentation says batches are scoped to a Workspace.[3] Amazon Bedrock batch inference runs through Amazon S3 and is constrained by Bedrock model support.[6] Google Vertex AI says Gemini batch inference through the global endpoint does not support data residency requirements, so regional endpoints are the safer default when residency is part of the contract.[5] Azure OpenAI batch support and quota choices depend on region, subscription, model, and deployment type.[7]
Batch jobs should use fixed memory snapshots
Batch is useful only when memory can be frozen before the input file is created. It is a poor fit for memory that depends on fresh account state, live authorization, or tool calls that require user confirmation. The operational question is not only cost; it is whether every request can safely carry the same versioned context boundary until the job completes.
| Provider path | Memory boundary to check | Operational note |
|---|---|---|
| OpenAI Batch API[2] | Attach a fixed memory snapshot, not live session state. | Use synchronous routing when the user is waiting in the UI or a tool action needs confirmation. |
| Anthropic Message Batches API[3] | Keep project context inside the Workspace boundary and snapshot it into each request. | Avoid jobs that depend on fresh tool state fetched during the interaction. |
| Vertex AI Gemini batch inference[5] | Choose regional endpoints when data residency is contractual. | Queued batch work can violate assumptions if memory changes before completion. |
| Amazon Bedrock batch inference[6] | Keep S3, IAM, and Bedrock model support aligned with the memory scope. | Do not assume provisioned model paths also support batch inference. |
| Azure OpenAI global batch[7] | Check deployment, region, quota class, and model availability before routing memory-heavy work. | Separate batch quota helps only when the target model and region support the workload. |
Where model choice still matters
Memory is not only storage. The model has to decide when remembered information is relevant, when the current instruction overrides it, and when the memory is too weak to use. A cheap model can be the right choice for nightly classification if the memory is simple and the output schema is strict. A stronger reasoning or coding model may be worth the cost when project context includes conflicting constraints, source citations, or tool calls with side effects.
Benchmark snapshot date: 2026-04-23. Public benchmark pages such as LMArena, SWE-bench, HumanEval, GPQA, and MMLU can help shortlist models, but they are weak evidence for memory governance.[11][12][13][14][15] A model can rank well and still overuse stale context, ignore a project boundary, or treat a stored preference as stronger than the current instruction.
If you need pricing, context-window, modality, and public benchmark fields after designing the memory rules, AI Models can help shortlist candidates. Treat that comparison as a routing aid; the provider docs remain the contract check.
FAQ
Is memory the same as prompt caching?
No. Memory is a product decision about what to retain and reuse. Prompt caching is an execution feature for repeated prompt prefixes. Anthropic’s prompt caching docs describe short cache lifetimes and minimum cacheable prompt lengths by model family.[4] That is useful for repeated long context, but it is not a permission system.
Should stored preferences override the current user instruction?
No. Current user instructions should win unless they conflict with system policy or a project rule the user cannot change. A stored preference like "answer in bullets" should not override "give me the exact JSON object only" in the current request.
Can batch jobs use memory safely?
Yes, if the memory is snapshotted before the file is created and each request carries the right project scope. Batch is a poor fit for memory that depends on fresh account state, live authorization, or tool calls that require user confirmation.
What should be deleted first?
Delete raw personal details first when they are not needed for future behavior. Keep a narrow derived preference only when the user can see it, edit it, and understand where it will apply.
The takeaway
Treat AI memory as a scoped data system with routing consequences. If a remembered item cannot pass five checks – source, scope, owner, expiry, and user-visible edit path – keep it session-only or retrieve it from the source when needed. Choose batch only when the work is independent, noninteractive, inside provider limits, and safe to run from a fixed memory snapshot.
Sources
- OpenAI Responses API reference – https://platform.openai.com/docs/api-reference/responses
- OpenAI Batch API guide – https://platform.openai.com/docs/guides/batch
- Anthropic Message Batches API documentation – https://docs.anthropic.com/en/docs/build-with-claude/batch-processing
- Anthropic prompt caching documentation – https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- Google Vertex AI Gemini batch inference documentation – https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini
- Amazon Bedrock batch inference documentation – https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html
- Azure OpenAI global batch documentation – https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/batch
- OpenAI function calling guide – https://platform.openai.com/docs/guides/function-calling
- Anthropic tool use documentation – https://docs.anthropic.com/en/docs/build-with-claude/tool-use
- OpenAI data controls documentation – https://platform.openai.com/docs/models/how-we-use-your-data
- LMArena leaderboard – https://lmarena.ai/leaderboard/
- SWE-bench benchmark – https://www.swebench.com/
- HumanEval benchmark repository – https://github.com/openai/human-eval
- GPQA paper – https://arxiv.org/abs/2311.12022
- MMLU paper – https://arxiv.org/abs/2009.03300