Knowledge base cleanup usually starts with a support pattern: customers ask the same question, search lands on the wrong page, or an article gives instructions that no longer match the product. AI models are useful for finding those problems at scale. The goal is not to let the model rewrite your help center by itself; the goal is to turn messy docs into a review queue an editor can trust.
A knowledge base gets messy in small commits: Zendesk Guide articles that were copied instead of redirected, Salesforce Knowledge entries that still mention retired plan names, Intercom Articles with old screenshots, chatbot answers that learned a workaround from last quarter, and support macros that contradict the public help center. AI models can help clean the library, but the output should be a queue with evidence, not rewritten pages that go straight to production.
As of 2026-04-23, the pricing, limits, behaviors, and public benchmark references below are summarized from the provider docs and benchmark pages listed in Sources; provider pricing and model availability change frequently, so verify those pages before quoting in a contract, RFP, or cost plan.
When To Use Synchronous, Batch, Or Tool Calling
Use synchronous model calls when an editor is waiting in a CMS or review tool. Use batch jobs when the cleanup run can wait, such as nightly duplicate scans, gap clustering, stale-content checks, and private evals. Use tool or function calling when the model must check a current source of truth, such as plan names, API fields, release status, or permission rules.
The vendor mechanics matter, but only after the cleanup shape is clear. OpenAI, Anthropic, Google Vertex AI, Azure OpenAI, and Amazon Bedrock all document asynchronous batch paths with different request limits, file limits, completion windows, and pricing rules; those paths fit overnight review queues better than editor-facing buttons.[1][2][3][4][5]
Before you size a run, verify current dollar math on the provider pricing pages.[6][7][8] For model shortlisting, a broader AI Models comparison can help narrow which model families to test, but final choice should come from private evals over your own article pairs, ticket clusters, and outdated-answer examples.
| Cleanup job | Route it to | Why |
|---|---|---|
| Editor clicks “analyze this article” | Synchronous endpoint | The editor is waiting, so a batch window is the wrong fit. |
| Nightly duplicate and conflict scan | Batch endpoint | The job can wait, and provider batch docs publish asynchronous limits for larger runs. |
| Check current plan names, API fields, or release status | Tool or function calling | The model should query approved product metadata instead of guessing from training data. |
| Final rewrite of a high-traffic article | Human editor plus source evidence | The model can draft, but product facts need an owner. |
Find Duplicates And Conflicts
Duplicates are not only pages with the same title. In a SaaS help center, “Reset SSO for a teammate,” “Recover access after SAML lockout,” and “Admin cannot sign in” may overlap enough to merge, while “Change workspace owner” and “Transfer billing ownership” may look similar but require different controls. The model should compare article pairs against the same customer intent, product area, permissions, and current UI labels.
A practical duplicate run uses two passes. First, generate candidate pairs from deterministic signals such as normalized title text, URL slug, product category, article tags, shared search queries, and embedding similarity. Second, send only the candidate pairs to the model and ask for a structured triage record: article IDs, URLs, quoted evidence from both pages, conflict type, recommended action, and confidence reason.
Keep the candidate stage aggressive, but do not let broad vocabulary create noise. A pair should usually need at least two signals before it reaches the model, such as shared search queries plus the same product category, or high embedding similarity plus overlapping ticket clusters. Pages that merely share words like “billing,” “admin,” or “workspace” should stay out unless the customer intent also matches.
- Candidate generation: group pages that share a search intent, such as “cancel subscription,” “close account,” and “delete workspace,” before you spend model tokens on full article comparison.
- False-positive guard: suppress pairs when the roles, product area, or task outcome differ, even if the titles sound similar.
- Model review: ask the model to choose one action: keep both, merge, redirect, update, retire, or escalate to product owner.
- Evidence check: require quote spans from each article, because a label like “duplicate” is not enough for an editor to trust the recommendation.
- Owner queue: cap each owner’s daily review list to the number of items they can actually approve, such as 20 pairs per product area, instead of creating a giant backlog that nobody clears.
Do not accept “these two pages are similar” as a final answer. A useful flag says, “Article A tells admins to remove a user from Billing Settings; Article B tells them to remove the same user from Workspace Members; both answer failed-seat-removal tickets; confirm which UI path is current, then merge or redirect.” That gives the editor the pages involved, the reason for the flag, and the next decision.
| Flag | Required evidence | Likely editor action |
|---|---|---|
| Duplicate intent | Same customer question, same role, same product area | Merge one page into the stronger canonical article and redirect the weaker page |
| Conflict | Two quoted instructions disagree on a setting, permission, plan, API field, or policy | Escalate to product owner before rewriting |
| Near duplicate | Same intent but one page covers an exception such as Enterprise SSO or API-only setup | Keep both, but cross-link and clarify scope |
| Terminology drift | Old and new feature names appear across related pages | Update the older language and add redirect or glossary handling if search traffic depends on the old term |
Example review record: a nightly scan pairs “Remove a paid seat from Billing Settings” with “Remove a teammate from Workspace Members.” Both pages match tickets where admins wrote, “I removed the seat but the user can still log in.” The model flags the pair as a conflict, not a pure duplicate, because one page is about billing capacity and the other is about account access.
| Field | Example output |
|---|---|
| Evidence from Article A | “Owners can remove paid seats from Billing Settings > Seats.” |
| Evidence from Article B | “Admins remove teammates from Workspace Members > People.” |
| Recommended action | Keep both articles, add scope notes, cross-link them, and update search synonyms for “remove paid user.” |
| Human decision | Accepted with one change: no redirect, because billing-seat removal and access removal are separate tasks. Product confirmed the current labels and the editor added a warning to the billing article. |
Identify Gaps From Real Questions
Gap detection should start from customer language, not from a brainstorm. Use solved support tickets, failed help-center searches, chatbot fallback events, thumbs-down feedback, and repeated support macros. If three different channels show the same unanswered question, that is stronger evidence than a model suggesting a topic because it sounds plausible.
- Support tickets: cluster solved tickets where agents pasted the same workaround, then ask whether the answer belongs in an existing article or a new one.
- Search logs: group queries with no click, short dwell time, or immediate ticket creation, then map each cluster to a current article URL or a missing topic.
- Chatbot failures: treat fallback prompts as a separate input source, because bot failures often reveal wording that docs teams never use.
- Feedback forms: send only the comment, article URL, product area, and current article excerpt to the model, then ask for the smallest useful update.
The output should distinguish “missing article” from “bad article.” A gap is missing when no canonical page answers the question. A bad article exists but fails because the title is unclear, the first paragraph skips the user’s symptom, the screenshots are stale, or the article assumes admin permissions the reader may not have.
A useful mini-workflow is simple: export one week of support tickets and help-center searches, remove personal data, cluster by customer intent, map each cluster to current URLs, then ask the model to produce a queue with new_article, update_existing, merge, or no_action. The reviewer should see the ticket count, search terms, related article URLs, and one sample customer phrase for each recommendation.
False positives often come from temporary incidents, bugs, or account-specific states. If every ticket in a cluster came from the same outage window, route it to release notes or status follow-up instead of creating a permanent help article. If the cluster contains only one enterprise account with a custom configuration, mark it as account-specific and keep it out of the public backlog.
Refresh Outdated Content
Outdated content is usually a source-of-truth problem. The model may notice that an article mentions a retired plan, deprecated API field, old UI label, or screenshot filename from a previous release, but it should not invent the replacement. Feed it approved product facts from release notes, a plan matrix, an OpenAPI schema, a feature-flag catalog, or a CMS field such as last reviewed date.
When the model must check live or internal data, use controlled tool paths. OpenAI, Anthropic, and Vertex AI all document patterns where the application supplies functions or tools and returns the results to the model.[9][10][11] For knowledge base cleanup, those tools should be read-only and narrow: fetch article metadata, fetch a product string, fetch a plan entitlement, or fetch the current API field list.
- Deprecated names: flag articles where the old feature name appears in the title, H2s, screenshot alt text, or first 100 words.
- Plan and policy drift: compare article claims against the current plan matrix before rewriting pricing, retention, permissions, or security statements.
- API drift: compare code snippets and field names against the current OpenAPI schema, then send suspected mismatches to the API owner.
- Screenshot drift: flag screenshots whose visible navigation, button label, or filename no longer matches the current UI copy.
A stale-content recommendation should carry one acceptance packet: the source record, the exact text being changed, the proposed replacement, and the owner who can approve that fact. That packet keeps the review focused on whether the fact is current, not on whether the model sounds confident.
Measure Support Impact
Measure cleanup like a product change. Take a baseline before the cleanup run, then compare the same metrics after publication using the same time window. A 28-day window is often long enough to smooth weekday support patterns without waiting a full quarter.
| Metric | Definition | Decision it supports |
|---|---|---|
| Search success | Searches that lead to an article click and no immediate repeat search | Whether gap articles use customer language |
| Ticket repeat rate | Tickets in the same intent cluster after the article update | Whether the cleanup reduced support volume |
| Helpful vote rate | Positive article feedback divided by total article feedback | Whether the answer is clear enough for self-service |
| Editor acceptance rate | AI recommendations accepted by a content owner without major correction | Whether the routing, prompt, and evidence rules are working |
| Rollback rate | Published changes reverted because the model or reviewer used wrong facts | Whether source-of-truth checks are strong enough |
Use public benchmark pages only as a coarse screen for general model behavior.[12][13][14][15][16] They do not tell you whether a model will avoid false duplicate flags in your knowledge base. Run a private eval with at least 100 duplicate pairs, 100 gap clusters, and 100 outdated-answer examples from your own library. Score the model on evidence quality, correct action, false positives, and whether the recommendation can be reviewed in under two minutes.
The routing rule is the part teams can apply tomorrow: send overnight scans to batch when the provider limits fit the job, keep editor-facing actions synchronous, require tool calls for current product facts, and publish only recommendations that include evidence a human owner can verify.
FAQ
What are the most common failure modes? The model may over-merge pages that share vocabulary but serve different roles, treat outage tickets as missing docs, update facts from stale articles, or produce too many low-confidence flags. Reduce those errors with stricter candidate thresholds, source records, and reviewer rejection codes.
What should not be automated? Do not automate deleting articles, creating redirects, changing pricing or security claims, altering compliance language, or publishing high-traffic rewrites. The model can prepare the evidence bundle, but those changes need a content owner and, often, a product or legal reviewer.
How should teams measure false positives? Track why reviewers reject recommendations: no real issue, wrong action, weak evidence, wrong source, or duplicate of an existing ticket. Set a precision target before scaling the job; for example, require that at least 80% of reviewed flags lead to an accepted update, merge, redirect, or documented no-action decision.
How often should cleanup audits run? Run lightweight scans weekly for new searches, tickets, and feedback. Run deeper monthly or quarterly audits after major launches, pricing changes, plan changes, API releases, and UI redesigns. Stale-content checks should also run immediately after release notes mention renamed features or deprecated fields.
Sources
- OpenAI Batch API docs – batch windows, request limits, and cost notes: https://platform.openai.com/docs/guides/batch
- Anthropic Message Batches docs – batch pricing, request limits, and expiration window: https://docs.anthropic.com/en/docs/build-with-claude/batch-processing
- Google Vertex AI Gemini batch prediction docs – batch discounts, request limits, queueing, and cache discount rule: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini
- Azure OpenAI Global Batch docs – target turnaround, request file limits, and cost notes: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/batch
- Amazon Bedrock batch inference docs – asynchronous S3 jobs and provisioned model limitation: https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html
- OpenAI pricing page – current model pricing reference: https://platform.openai.com/docs/pricing/
- Anthropic pricing page – current Claude pricing reference: https://docs.claude.com/en/docs/about-claude/pricing
- Vertex AI pricing page – current Gemini and Vertex AI pricing reference: https://cloud.google.com/vertex-ai/generative-ai/pricing
- OpenAI function calling docs – tool and function calling patterns: https://platform.openai.com/docs/guides/function-calling
- Anthropic tool use docs – Claude tool use patterns: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/overview
- Vertex AI function calling docs – Gemini function calling patterns: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/function-calling
- MMLU paper – broad multitask language understanding benchmark: https://arxiv.org/abs/2009.03300
- GPQA paper – graduate-level question answering benchmark: https://arxiv.org/abs/2311.12022
- HumanEval repository – code generation benchmark: https://github.com/openai/human-eval
- SWE-bench site – software engineering benchmark: https://www.swebench.com/SWE-bench/
- LMArena leaderboard – crowd-sourced model comparison leaderboard: https://lmarena.ai/leaderboard/