By Deep Digital Ventures AI Engineering Team
Published April 23, 2026; updated April 24, 2026.
The DDV AI Engineering Team builds model-evaluation, routing, and private-data controls for production AI systems.
This guide is for AI engineers, platform engineers, AI product managers, and startup CTOs building evaluation datasets from real user tasks without copying sensitive customer, employee, patient, account, or document data into a shared test set. It focuses on what to keep, what to rewrite, how to score, and how to govern the dataset before any model or provider sees it.
As of April 23, 2026, the provider limits and batch details in the appendix were summarized from linked provider docs. Pricing, limits, and model availability change frequently, so verify those pages before quoting them in a contract, RFP, or cost plan.
Public benchmarks such as MMLU[1], GPQA[2], SWE-bench[3], HumanEval[4], and LMArena[5] are useful for shortlisting candidate models, but they do not prove that a model can handle your support tickets, internal workflows, RAG answers, tool calls, or safety constraints. A private evaluation dataset is where you test your own product moments without turning raw user history into a permanent warehouse.
Quick Summary
- Decision rule: keep an evaluation case only if it can change a model route, prompt, tool contract, safety rule, cost path, or release decision.
- Mini-workflow: name the decision, extract the task pattern, replace identifiers, attach approved source material, write expected behavior, then choose synchronous or batch execution as metadata.
- Redaction checklist: remove direct identifiers, rewrite quasi-identifiers, inspect filenames and metadata, check tool-call JSON and cached prompts, and record the rewrite method without storing the original value.
- Restricted evidence: if the original sensitive case is needed for audit or domain review, keep it in a restricted evidence store and create a sanitized companion row for the shared regression set.
Start With Evaluation Goals
Do not collect user examples until every row has a job. Each test case should connect to a model-routing, prompt, safety, cost, or release decision; if a detail does not change one of those decisions, remove it before the example enters the dataset.
- Task success: for a support workflow, did the model answer the user’s actual refund-policy question from the supplied policy text?
- Instruction following: for an extraction workflow, did the model return the required JSON fields and avoid extra prose?
- Grounding: for RAG, did the answer use the cited source section instead of general model memory?
- Safety and privacy: did the model avoid exposing secrets, credentials, personal identifiers, or private account facts?
- Tool behavior: for function calling or tool use, did the model call the right tool, pass valid arguments, and use the tool result in the final answer?[6][7]
- Execution path: should this test run synchronously for launch gating, or in a batch job for overnight regression coverage?
A useful goal names the action it will trigger: keep the current default model, route a high-risk task to a stronger model, test a cheaper model for low-risk rows, block a release, or rewrite the prompt. Without that action, the example is measurement noise.
Use Task Patterns Before Raw Examples
Real usage should shape the dataset, but raw logs are rarely the right first artifact. Convert production signals into task patterns that preserve the difficulty of the task while removing the customer, employee, patient, account, document, or vendor details that made the original example sensitive.
| Raw usage signal | Safer evaluation artifact |
|---|---|
| Customer support transcript with an email, order number, and refund complaint | Support scenario with synthetic order ID, redacted email, expected policy section, and a rubric for whether the answer offers the right next step |
| Contract PDF containing client names and negotiated terms | Synthetic contract excerpt that keeps headings, clause order, and ambiguity but replaces parties, dates, and commercial terms |
| User prompt containing private health, finance, or HR facts | Task pattern that states the risk class, allowed answer type, and required refusal or escalation behavior |
| Failed production answer from a RAG system | Error category, source snippets, recreated sanitized prompt, and expected correction |
| Internal workflow that calls CRM, billing, or ticketing tools | Permissioned tool-use test with fake IDs, fake tool outputs, and a rule that the model must not invent tool results |
In one DDV support-policy gate, a bad candidate row was a copied transcript with a real email, order number, exact refund amount, and agent note. The useful row kept the hard part: a late refund request, a policy exception path, a synthetic order ID, and a source excerpt that required escalation instead of a promise. That sanitized row caught a prompt bug where the model promised approval before checking the exception rule; after the prompt required a policy citation and escalation flag, the small gate moved from four failures in twelve cases to zero before the case entered the regression set.
Batch is an execution choice, not a dataset label. The same sanitized row can run synchronously for a release gate, then run in bulk for regression testing. Keep provider limits, discount assumptions, file-size limits, and queue behavior in a maintained appendix or runner configuration, not scattered through the task narrative.
Here is a safe workflow for turning a real task into an evaluation case:
- Name the product decision first: for example, route billing-dispute summaries to the cheapest model that preserves required fields.
- Extract the task pattern, not the raw conversation: user intent, input shape, required sources, expected output format, and failure mode.
- Replace identifiers before review: customer name becomes
Customer A, account ID becomes a synthetic UUID, and exact dollar amounts become realistic placeholders unless the math itself is being tested. - Attach source material that is already approved for evaluation, such as a sanitized policy excerpt or synthetic tool output.
- Write the expected behavior: fields required, claims forbidden, source IDs required, and scoring rubric.
- Choose execution mode last: run a small synchronous smoke test for release risk, then run the larger sanitized regression set through batch when the result is not needed during the user session.
Redact and Rewrite Sensitive Data
Redaction should happen before examples become broadly visible. NIST SP 800-122, Guide to Protecting the Confidentiality of Personally Identifiable Information, recommends context-based handling of PII because risk depends on the data, the system, and the likely impact of disclosure.[8] That is the right posture for AI evals: a name may be harmless in a public press release, but sensitive when attached to an account dispute, medical question, or HR complaint.
- Remove direct identifiers such as names, emails, phone numbers, street addresses, account numbers, API keys, and employee IDs.
- Replace company, customer, patient, or employee details with realistic placeholders that preserve role and relationship, such as
regional resellerorformer employee. - Mask exact financial figures when the task is summarization or classification; keep synthetic but mathematically consistent figures when the task tests arithmetic.
- Preserve formatting when it affects model behavior, such as malformed tables, nested bullets, OCR errors, contradictory clauses, or broken JSON.
- Inspect filenames, screenshot text, PDF metadata, spreadsheet tabs, vector-store metadata, retrieved chunks, tool-call JSON, error traces, and cached prompts.
- Record the rewrite method without storing the original value, for example: all customer names replaced with neutral labels; all account IDs replaced with synthetic UUIDs.
Do not redact only the visible prompt. A dataset review should inspect the full payload that will be sent to the model and the full output that will be stored after the run. That includes traces and intermediate files created by the evaluation harness, not just the prompt shown in a spreadsheet.
Decide When Synthetic Data Is Better
Synthetic examples reduce privacy risk when the product task depends on structure rather than exact history. A billing email, insurance-style form, software bug report, or HR-policy question can often be recreated from observed patterns while avoiding the original person, employer, account, or document.
- Use synthetic examples for common workflows where the hard part is format, policy retrieval, or instruction following.
- Base the example on observed production patterns, such as a user asking for an exception after a deadline, instead of inventing tidy prompts that users never write.
- Recreate known failures with fake entities: truncation, contradictory source passages, malformed tables, missing attachments, wrong tool selection, and unsafe disclosure requests.
- Keep domain review for high-impact areas. If a test checks tax, medical, legal, hiring, or financial content, have the rubric reviewed by someone qualified to judge that domain.
- Separate public benchmark evidence from private task evidence. Benchmarks can indicate broad capability, but they do not say whether your product should choose batch, cache, tool use, fallback routing, or a stricter refusal policy for your own users.[1][2][3][4][5]
The strongest evaluation set usually mixes sanitized real patterns, synthetic examples, and recreated incident cases. Keep the source label on each row so reviewers know whether the example came from production, an incident review, a domain expert, or a benchmark-inspired pattern.
Control Access to the Dataset
Evaluation data is a product asset and a security asset. NIST AI Risk Management Framework 1.0 organizes AI risk work around Govern, Map, Measure, and Manage.[9] For eval datasets, that means ownership, provenance, access control, measurement history, and retirement rules need to be visible before a model decision is made.
| Control | Practical rule for AI model evaluation |
|---|---|
| Owner | Assign a named product or platform owner who approves new task families and retires stale examples. |
| Role-based access | Let most engineers see sanitized rows; restrict any row that still needs permissioned source text or regulated content. |
| Versioning | Record dataset version, prompt version, model family, provider endpoint, and run date for each comparison. |
| Retention | Keep raw source artifacts out of the shared dataset; set a deletion review for old sanitized examples that no longer match the product. |
| Audit logging | Track who changed test inputs, rubrics, expected outputs, scoring scripts, and judge prompts. |
| Provider boundary | Document whether the row is allowed for approved external providers, only specific regions, or only an internal environment. |
Provider-specific storage paths also affect governance. Some batch systems write input, output, logs, and error files to separate storage locations; others use deployment or region boundaries that affect who can access the data. Those details belong in the dataset policy because they determine who can see evaluation payloads after the run, not just during inference.
Track Expected Behavior
An input alone is not an evaluation. Each row needs an expected behavior contract that a reviewer, scoring script, or judge model can apply consistently across providers.
| Dataset field | Example value for a support-policy eval |
|---|---|
| Case ID | support_refund_policy_042, not a production ticket ID. |
| Sanitized input | User asks whether a late refund exception is possible after a missed deadline. |
| Task pattern | RAG answer with policy citation and escalation rule. |
| Approved source | Sanitized refund-policy excerpt, section 3.2. |
| Expected output | Answer must state the standard rule, mention exception review, and cite section 3.2. |
| Disallowed behavior | No invented refund amount, no promise of approval, no request for full card number. |
| Scoring rubric | Pass only if policy source is used, escalation is correct, and no private data is requested. |
| Execution hint | Run synchronously for launch smoke test; include in overnight batch regression. |
| Reason retained | Production failures showed models often promised refunds without source support. |
Tool-use evals need an additional contract. The row should specify the allowed tools, valid argument shape, fake tool output, required final-answer behavior, and failure condition. A model that calls the right billing lookup tool but ignores the returned status still fails the product task.
Build the Dataset Gradually
The first evaluation dataset does not need thousands of rows. It needs coverage of the task families that will change routing, cost, safety, or release decisions. Add examples in layers so the dataset grows from real product pressure rather than from whatever logs are easiest to copy.
- Start with high-volume user tasks, such as support answers, extraction, summarization, code assistance, or document Q&A.
- Add known failures from incident reviews, especially hallucinated citations, wrong tool calls, unsafe disclosure, and format drift.
- Add edge cases before launch: empty source results, conflicting sources, missing fields, long context, malformed input, and user attempts to override instructions.
- Add provider-comparison rows only when they test the same sanitized product case and can change a real route, fallback, or release decision.
- Retire examples when the product, source policy, schema, or model route changes enough that the row no longer tests a real decision.
When model choice or cost becomes part of the retention reason, keep that outside the row payload. Use Deep Digital Ventures AI Models as a separate planning aid for pricing, context windows, modalities, and model notes, then snapshot only the decision evidence that actually justified the dataset row.
Provider and Benchmark Appendix
Last verified: April 23, 2026. Keep volatile limits, discounts, quotas, and source URLs in one maintained table so the main dataset guidance does not become stale when a provider changes a batch policy.
| Reference area | Volatile detail to maintain outside the eval row | Dataset implication |
|---|---|---|
| Anthropic Message Batches[10] | Batch discount, maximum requests, file size, and completion window. | Store batch eligibility and provider boundary as metadata. |
| OpenAI Batch API[11] | Batch discount, completion window, request count, and input-file limit. | Keep file-size constraints in the runner, not the row text. |
| Vertex AI batch inference for Gemini[12] | Batch discount, request count, Cloud Storage file limit, queue behavior, SLA treatment, and cache interaction. | Record whether a cached or batch path was used for comparison. |
| Amazon Bedrock batch inference[13] | S3 input and output locations plus permissions for the identity creating the job and the service role running it. | Document which buckets, roles, logs, and error files can see eval payloads. |
| Azure OpenAI Batch[14] | Deployment type, enqueued-token quota, and batch-specific limits. | Record data-zone and quota boundaries before sending restricted rows. |
| Public benchmark snapshot[1][2][3][4][5] | Benchmark pages, leaderboards, and model lists can change faster than release cycles. | Use benchmark evidence for candidate selection only, then rely on private task rows for product decisions. |
A Useful Dataset Protects Users and Improves Quality
Use this decision rule tomorrow: keep an evaluation case only if it can change a model route, prompt, tool contract, safety rule, cost path, or release decision. If the case requires live sensitive data to be meaningful, keep the original under restricted review and create a sanitized companion case for the shared regression set.
The dataset should prove quality without broadening exposure. Build from task patterns, redact before sharing, use synthetic examples where structure matters more than identity, control provider boundaries, version every run, and make the expected behavior explicit enough to compare results across models and execution paths.
FAQ
What about screenshots, PDFs, and spreadsheet metadata?
Treat them as part of the payload, not as attachments outside the eval. Check OCR text, embedded comments, image alt text, EXIF data, PDF properties, filenames, sheet names, hidden columns, and vector-store metadata before the row enters the shared dataset.
Can a judge model score these rows?
Yes, but only after the row has a clear expected behavior contract. Keep deterministic checks for required fields, citations, tool arguments, and forbidden claims; use a judge model for qualitative grading only when the judge prompt, model version, and review sample are logged.
How should regulated data be approved?
Classify the data first, get domain or compliance approval, minimize the fields, restrict provider destinations, encrypt the evidence store, and set a deletion review. The broad regression dataset should normally contain sanitized companion rows, not regulated originals.
How do we retain restricted originals safely?
Store originals in a separate evidence system with least-privilege access, audit logs, short retention, and a case ID that links to the sanitized row. Engineers should be able to run the shared eval without opening the original person, account, document, or conversation.
Sources
- MMLU benchmark paper: https://arxiv.org/abs/2009.03300
- GPQA benchmark paper: https://arxiv.org/abs/2311.12022
- SWE-bench benchmark site: https://www.swebench.com/
- OpenAI HumanEval dataset page: https://huggingface.co/datasets/openai/openai_humaneval
- LMArena leaderboard: https://lmarena.ai/leaderboard/
- OpenAI function calling guide: https://platform.openai.com/docs/guides/function-calling
- Anthropic tool use documentation: https://docs.anthropic.com/en/docs/build-with-claude/tool-use
- NIST SP 800-122, Guide to Protecting the Confidentiality of Personally Identifiable Information: https://csrc.nist.gov/pubs/sp/800/122/final
- NIST AI Risk Management Framework 1.0: https://www.nist.gov/itl/ai-risk-management-framework
- Anthropic Message Batches documentation: https://docs.anthropic.com/en/docs/build-with-claude/batch-processing
- OpenAI Batch API guide: https://platform.openai.com/docs/guides/batch
- Google Vertex AI batch inference for Gemini: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini
- Amazon Bedrock batch inference documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html
- Azure OpenAI Batch documentation: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/batch