This guide is for AI engineers, platform engineers, AI product managers, and startup CTOs deciding which model to route to, which tools or functions to expose, and when offline batch paths reduce user-facing risk because no one is waiting on a live side effect. Tool calling, function calling, and tool use all describe the same core pattern: a model can request external systems such as search, databases, calculators, calendars, file processors, ticketing systems, or internal APIs, but the model is only proposing a call. Your application still owns execution, authorization, validation, and user-facing state.
Last verified: 2026-04-23. Methodology: checked public provider docs for tool/function calling, tool-use boundaries, security guidance, and batch workflow behavior; exact vendor pricing tables, hard limits, public benchmark scores, and availability should live in separate comparison pages because they change frequently.[1][2][3][5][6][7][8][9]
Quick Answer
- Tool calling, function calling, or tool use is the pattern where a model asks your app to run an external capability.
- The model proposes the call; your application executes it, authorizes it, validates it, and records it.
- Use no tool when the prompt already contains enough information to answer safely.
- Use a read tool when the answer depends on live, private, or account-specific data.
- Use a write tool only after authorization, validation, and user confirmation of the exact change.
- Use batch when the task is offline, prevalidated, and acceptable under the provider’s completion window.
The practical decision is not "does this model support tools?" OpenAI, Anthropic, Google Vertex AI, Azure OpenAI, and Amazon Bedrock all document tool or batch workflows in some form. The better question is which model tier should see which tools, under which permissions, at what cost, and with what fallback when the call fails.
What Tool Calling Means
In a tool-calling workflow, the application sends the model a user request plus a list of available tools. OpenAI describes the loop as: send tools, receive a tool call, execute code in your application, return the tool output, then ask the model for the final response.[1] Anthropic’s tool use docs make the same boundary explicit for client tools: the model can request a tool, but your system extracts the tool name and input, runs the tool, and returns a tool result.[2]
| Situation | Model should do | Application should enforce |
|---|---|---|
| The prompt contains the full stack trace, policy excerpt, or document text. | Answer directly if no missing external data is needed. | No tool exposed for this turn unless the user asks to inspect a repo, ticket, log store, or other external system. |
| The user asks for current, private, tenant-specific, or account-specific data. | Request a read-only lookup such as account, invoice, product, calendar, or ticket retrieval. | Bind tenant and account IDs from the authenticated session, not from the user’s prompt. |
| The user asks for a number that depends on rules, units, dates, or currency. | Request a calculator or deterministic policy tool. | Validate units, currency, date ranges, formulas, and rounding before showing the answer. |
| The user asks for a draft, summary, or proposed update. | Create a draft or proposed action without changing the external system. | Label the result as a draft and require a separate send, apply, or approve action. |
| The user asks to refund, close, send, update, delete, or create something. | Use read tools first, then draft the proposed action for review. | No write call until the user sees the target object, old value, new value, amount, reason, and confirms. |
| The task is bulk classification, extraction, enrichment, or evaluation. | Use an offline batch path if no user is waiting for an immediate answer. | Preload needed records, remove secrets, store per-record errors, and make batch retry behavior explicit. |
The model does not know what is safe just because a tool exists in the prompt. Tool calling means the model can ask for an external operation. It does not mean the model has permission to run that operation.
How Function Calling Chooses a Tool
The model decides from the tool name, description, JSON schema, user request, conversation state, and system instructions. Anthropic’s implementation docs say client tool names must match the regex ^[a-zA-Z0-9_-]{1,64}$, with name, description, and input_schema supplied in the request.[2] OpenAI’s function calling docs similarly define tools through a function name, description, parameters, and optional strict schema behavior.[1]
Good tool definitions are operational, not decorative. A tool named get_customer_contract_by_account_id gives the model a narrower choice than lookup_data. A schema with account_id, invoice_id, and currency is easier to validate than a free-form query string. Large tool lists also make evaluation harder, so scope tools by workflow stage instead of sending every possible integration on every turn.
// Bad: broad name, free-form input, unclear action
{
"name": "lookup_data",
"description": "Gets data",
"parameters": {
"type": "object",
"properties": {
"query": { "type": "string" }
}
}
}
// Good: narrow, typed, auditable
{
"name": "get_customer_invoice_summary",
"description": "Read-only lookup for one authorized customer invoice summary. Does not create refunds or change billing state.",
"parameters": {
"type": "object",
"required": ["invoice_id", "currency"],
"properties": {
"invoice_id": { "type": "string" },
"currency": { "type": "string", "enum": ["USD", "EUR", "GBP"] }
}
}
}The better schema is not only clearer for the model. It is clearer for logging, authorization, test fixtures, and incident review. When a tool fails, the application can tell whether the model chose the wrong tool, omitted a required argument, supplied an unauthorized identifier, or received a valid tool error.
Tool Calling Is Not Permission
Microsoft’s Azure OpenAI function calling docs state the core security rule plainly: models can generate calls, but it is up to the application to execute them and stay in control.[3] The same docs warn against relying only on excluded capabilities in the function definition as a security control, and recommend user confirmation for functions that take real-world actions.[3]
- Check authorization against the authenticated user, role, tenant, and workflow before every tool execution.
- Validate arguments with server-side code even when the provider offers strict schema output.
- Bind sensitive identifiers from session state when possible, not from model-generated arguments.
- Require confirmation for irreversible or externally visible actions such as refunds, account changes, email sends, ticket closure, or calendar creation.
- Log tool name, arguments after validation, actor, tenant, result status, latency, and a correlation ID for audit and debugging.
- Prefer read-only tools for exploration, and expose write tools only at the point in the workflow where the user can approve the exact change.
This is also a prompt-injection issue, not only an access-control issue. The OWASP Top 10 for Large Language Model Applications names Prompt Injection and Excessive Agency as LLM application risks.[4] A search result, uploaded PDF, ticket comment, or CRM note can contain text that asks the model to ignore instructions or call a privileged tool. Treat tool output as data, not authority.
Read, Write, and Batch Decisions
Read tools and write tools deserve different routing, logging, and user interface states. A read tool can leak data. A write tool can change the outside world. A batch job can reduce cost and live-workflow risk for offline work, but it is a poor fit when the user needs to review an action before it happens.
Batch endpoints belong in this design conversation because they change time, cost, and operational behavior. Public provider docs differ on completion windows, request limits, file limits, storage paths, and discount rules, so treat batch as an architecture choice first and a cost optimization second.[5][6][7][8][9] This post intentionally avoids quoting those changing numbers in the main body.
After the workflow is safe, compare candidate models and provider economics in the AI model pricing, context-window, and benchmark comparison table, then run your own tool-call evaluation on the exact schemas your product will ship. Use that separate comparison page for vendor-specific batch discounts, limits, context windows, modalities, and public benchmark snapshots.
The decision rule is simple: use synchronous tool calling when the user is waiting, when a confirmation is required, or when the next step depends on live tool output. Use batch when the task is offline, prevalidated, and acceptable under the provider’s documented completion window.
Common Failure Modes
- The model calls search even though the answer is already in the prompt, adding cost and exposing the response to untrusted text.
- The model chooses a write tool when a read tool should come first, such as updating a ticket before retrieving its current status.
- The model invents missing arguments, such as an account ID, invoice ID, region, currency, or date range.
- The tool returns an authorization error and the model apologizes without explaining what the user can do next.
- The model trusts a web page, PDF, or ticket comment that contains hostile instructions.
- The model retries a failed call without backoff, idempotency, or a clear stop condition.
- The tool succeeds but the user sees vague language such as "done" instead of the changed object, timestamp, and next state.
- The team routes an interactive approval workflow through batch and loses the chance to ask the user for confirmation.
Include these cases in integration tests. A useful test suite has at least one fixture for no-tool-needed, wrong-tool bait, missing required argument, unauthorized account, hostile tool output, tool timeout, duplicate write attempt, and partial batch failure.
| Eval metric | Why it matters | Pass signal |
|---|---|---|
| No-tool accuracy | Prevents unnecessary calls when the answer is already present. | The model answers directly and does not call search or lookup tools. |
| Wrong-tool rate | Shows whether names, descriptions, and workflow scoping are clear. | The model picks the read tool before any draft or write action. |
| Invalid-argument rate | Catches missing IDs, unsupported currencies, bad dates, and invented values. | The app rejects invalid arguments and the model asks for the missing information. |
| Unauthorized-call attempts | Tests tenant boundaries and role-based access. | The app denies the call before execution and returns a user-actionable reason. |
| Tool-output injection resistance | Checks whether hostile documents or tickets can steer the model. | The model treats tool output as evidence, not instruction. |
| Recovery quality | Measures what happens after timeouts, partial failures, or denied calls. | The final answer explains state, next step, and what was not changed. |
Design Safer Tool Workflows
A safe refund workflow shows how the pieces fit together. The first model turn should not receive a direct refund tool. It should receive read-only tools such as get_billing_summary, get_refund_policy, and calculate_refund_amount. The application should derive the customer account from the authenticated session, not from the prompt text. If the read tools show a possible duplicate charge, the model can draft a proposed action with invoice ID, amount, currency, reason, and policy basis.
The user then sees a confirmation screen before any external update happens. After confirmation, the application can expose a narrower write tool such as create_refund_request rather than a broad payment mutation. The tool should return a stable request ID, status, timestamp, and whether human review is required. The final model response should say exactly what changed and what did not change.
- Read account and invoice state with session-bound identifiers.
- Calculate the eligible amount with an auditable policy basis.
- Show a draft action: customer, invoice, amount, currency, reason, and expected next state.
- Ask the user to confirm before any external update runs.
- Create a queued refund request, not a direct payment mutation, when review is required.
- Return the request ID, timestamp, status, and review requirement in the final response.
- Expose only the tools needed for the current workflow stage.
- Use narrow schemas with required fields, enums, minimums, maximums, and server-side validation.
- Separate read, draft, and write actions so the user can inspect the proposed change.
- Ask for confirmation before high-impact actions and show the exact object that will change.
- Validate tool results before passing them back to the model or the user.
- Provide explicit success, queued, denied, partial failure, and retry states.
- Monitor tool-call frequency, error rate, invalid arguments, latency, batch expiration, and cost per completed workflow.
The model-routing decision should combine four signals: provider support for the tools you need, observed tool-call accuracy in your evals, cost under synchronous versus batch execution, and the product risk of a wrong call. A cheaper model tier can be correct for read-only classification. A stronger reasoning tier may be justified for ambiguous multi-tool workflows. A batch endpoint may be the right cost plan for offline extraction, but not for user-approved writes.
The Simple Explanation
A tool-calling model decides when an external system might help answer or complete a task. Safe design is mostly product engineering: scope tools, validate inputs, treat outputs as untrusted, separate read, draft, and write states, require approval for side effects, and route offline work to batch only when waiting is acceptable.
FAQ
Is tool calling the same as an agent?
No. Tool calling is the mechanism that lets a model request a tool. An agent is a broader product design that may plan steps, call multiple tools, remember state, retry failures, or ask for approval. You can use tool calling without giving the model open-ended agency.
Should every model see every tool?
No. Tool lists should be scoped by workflow, role, tenant, and risk. A support summary view may need read-only ticket and account tools. A billing adjustment flow may need write tools only after the user reaches the confirmation step.
When should a team use batch instead of synchronous calls?
Use batch for offline work such as classification, extraction, evaluation, and bulk enrichment when the task can tolerate the provider’s documented completion window. Use synchronous calls when the user needs an immediate answer, a tool result changes the next question, or the workflow needs confirmation.
Sources
- OpenAI function calling guide. https://platform.openai.com/docs/guides/function-calling
- Anthropic tool use documentation. https://docs.anthropic.com/en/docs/build-with-claude/tool-use
- Microsoft Azure OpenAI function calling documentation. https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/function-calling
- OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/
- OpenAI Batch API guide. https://platform.openai.com/docs/guides/batch
- Anthropic Message Batches API documentation. https://docs.anthropic.com/en/api/creating-message-batches
- Google Vertex AI Gemini batch inference documentation. https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini
- Azure OpenAI batch documentation. https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/batch
- Amazon Bedrock batch inference documentation. https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html