This comparison is about which multimodal models should review PDFs, charts, and screenshots in production, not which provider has the longest spec sheet. The useful question is whether a model can read the evidence, preserve the number or UI state, and return a result your product can route safely.
As of 2026-04-23, the pricing, limits, and behaviors below are summarized from provider docs and DDV spot tests; provider pricing and model availability change frequently, so verify the linked sources before quoting in a contract, RFP, or cost plan.
The comparison covers three jobs: PDF review, chart extraction, and screenshot review. You should leave with a routing rule: use PDF-aware models when page structure matters, pair chart vision with source data when exact numbers matter, keep live screenshot questions synchronous, and move only offline homogeneous work to batch.
Before building the test harness, use Deep Digital Ventures AI Models to filter candidates by modality, context window, public benchmark fields, pricing per million input and output tokens, and cost-estimator scenarios. The article can explain the decision; the tool lets you change token volumes, compare candidates side by side, and shortlist models for your own files.
PDFs, Charts, and Screenshots Fail Differently
A SEC Form 10-K, an ASC 606 revenue-recognition footnote, a Power BI screenshot, and a scanned invoice can all enter the same “document review” queue, but they do not fail the same way. The model that summarizes a clean text PDF well may still misread a chart axis, miss a disabled UI control, or invent a number when the scan is low quality.
| Task | Recognizable input | Model must handle | Failure risk | Routing hint |
|---|---|---|---|---|
| PDF review | SEC Form 10-K annual report, contract exhibit, invoice PDF | Text layer, reading order, headings, footnotes, citations, cross-page context | Missing a key clause, losing a footnote, or mixing two page sections | Use a PDF-aware model only when the file fits provider page and size limits; otherwise split with stable page IDs. |
| Chart extraction | ASC 606 revenue table, ARR chart, board-deck dashboard | Axes, labels, trend direction, units, legends, visible data labels | Returning the right trend but the wrong number, unit, or time period | Use vision for interpretation, but preserve the source table, CSV, XBRL, or spreadsheet when exact values matter. |
| Screenshot review | Power BI dashboard, billing screen, support ticket screenshot | Visible text, UI state, selected filters, disabled controls, error banners, spatial relationships | Misidentifying a button, missing a selected filter, or treating hidden state as visible evidence | Use synchronous review for user-facing support and batch only for offline QA or backlog labeling. |
What Our Spot Tests Showed
DDV ran a compact April 2026 evaluation before writing this post: 50 files total, with 10 clean PDFs, 10 scanned PDFs, 10 chart images, 10 dashboard screenshots, and 10 cross-page documents. Each example had a written answer key with the expected answer, evidence location, page or region, unit, and whether a human review flag should be returned.
The exact tasks were narrow on purpose: identify the revenue row and fiscal year in a filing excerpt, return “not readable” for poor scans, extract a chart value with its unit, name a selected dashboard filter and disabled control, and compare renewal terms across two pages. That made the failures easier to separate from prompt style.
| Model route tested | Best result | Common miss | Observed tradeoff |
|---|---|---|---|
| OpenAI vision route through Responses | Strong screenshot reasoning when the prompt required visible evidence and a structured JSON answer. | Occasionally described the right UI state but failed to name the exact selected filter when the dashboard was crowded. | Best fit where the visual answer needs tool calls or policy checks in the same workflow. |
| Claude PDF route | Strong on long-form PDF questions where headings, footnotes, and charts appeared inside the same document. | Low-quality scans still needed OCR fallback; otherwise the model sometimes answered from plausible context instead of readable evidence. | Best fit for PDF-native review, with page and file limits checked before routing. |
| Gemini document route on Vertex AI | Handled larger PDF envelopes and offline document batches cleanly in the test plan. | Chart answers still needed source data validation when the visual label was small or the unit changed across panels. | Best fit when long PDFs and cloud batch operations are already part of the pipeline. |
| Extraction plus model review | Most reliable for CFO charts, revenue tables, and invoices where exact values mattered. | Required more engineering work: PDF parsing, OCR, table extraction, and deterministic checks before the model call. | Higher pipeline complexity, but fewer numeric and unit errors than vision-only review. |
The failures were more useful than the wins. Vision-only review was usually good at trend language and screenshots with clear labels, but it was weaker when the answer depended on a tiny axis label, a negative sign, a disabled control, or a number copied from a dense table. The cost and latency tradeoff was straightforward: batch made sense only when delay was acceptable and the files were homogeneous enough to audit by custom ID.
Best Model Traits for PDF Review
For PDF review, compare models on reading order, table handling, page citations, footnotes, scan quality, and file-fit limits. Claude deserves a separate PDF test because Anthropic documents PDF support for text, pictures, charts, and tables in PDFs, with a 32 MB maximum request size and 100 pages per request.[1] Gemini on Vertex AI becomes important when long PDFs are common because the Vertex document-understanding page lists larger model-specific PDF envelopes for Gemini 2.5 Pro and Gemini 2.5 Flash.[2]
The OpenAI path fits application workflows where the visual answer may need a tool call. The Responses API supports text and image inputs with text or JSON outputs, built-in tools, and function calling.[3] Image inputs are metered in tokens, and OpenAI’s function calling guide shows how an application can give the model JSON-schema tools.[4][5] If a screenshot review must call a policy checker, fetch customer metadata, or open an internal runbook, test that tool flow, not only the visual answer.
What Matters for Chart Extraction
For charts and tables, vision should not be the only source of truth when precision matters. A stronger workflow extracts the text layer, OCR output, table cells, CSV, XBRL value, or spreadsheet first, then asks the model to compare that source data with the rendered page or screenshot.
The practical pipeline is simple: parse the PDF or spreadsheet, attach the rendered page image, ask the model to separate “quoted evidence,” “calculation,” and “inference,” and use tool calling for deterministic checks such as totals, percentages, and date comparisons. OpenAI documents this pattern through function calling, and Anthropic documents it through Claude tool use.[5][6] The model can explain the result, but the arithmetic should come from code or the original data.
When Batch Beats Synchronous Review
Batch is a cost and throughput tool, not a default architecture. Use it for nightly invoice extraction, backfill labeling, regression review, and benchmark runs. Keep live screenshot support, interactive contract review, and user-facing error diagnosis on synchronous endpoints unless the product explicitly tells the user the answer will arrive later.
| Provider route | Decision-critical dated limits and behavior as of 2026-04-23 | Best use |
|---|---|---|
| OpenAI Responses and Batch | Responses supports text and image inputs, JSON outputs, built-in tools, and function calling. Batch supports offline request files; verify current request and file-size envelopes before submitting. | Interactive workflows where the visual answer may need a tool call, plus offline jobs that fit the batch file envelope. |
| Anthropic Claude PDF and Message Batches | PDF support page lists 32 MB and 100 pages per request. Message Batches may contain up to 100,000 requests and 256 MB, can take up to 24 hours, and Anthropic documents batch usage at 50% of standard API prices.[1][7][8] | PDF-heavy review and offline document jobs where 24-hour completion is acceptable. |
| Google Vertex AI Gemini | Document understanding lists model-specific PDF limits, including larger file and page envelopes for Gemini 2.5 models. Batch inference docs state a 50% discounted rate, up to 200,000 requests in a single batch job, a 1 GB Cloud Storage input file limit, possible queueing for up to 72 hours under high traffic, and exclusion from the Service Level Objective of any SLA. Cache-hit discounts do not stack with batch; the cache hit discount takes precedence.[2][9] | Long PDFs, large offline jobs, and cloud-native pipelines already running on Vertex AI. |
| Amazon Bedrock batch | Batch inference uses S3 input and output for asynchronous jobs and is not supported for provisioned models. Model IDs, input modalities, streaming support, and quotas are exposed through Bedrock foundation model information docs.[10][11] | AWS-controlled pipelines where governance, storage, and model access are centered on Bedrock. |
| Azure OpenAI Batch | Azure OpenAI batch docs describe a 24-hour target turnaround and 50% lower cost than global standard. The quotas page lists batch limits such as 100,000 requests per file and 200 MB maximum input file size.[12][13] | Azure-controlled workloads where discounted offline processing matters more than immediate response. |
| Workload | Endpoint choice | Reason |
|---|---|---|
| User asks “what does this error screenshot mean?” in a live support flow | Synchronous vision request | The answer is needed now, and batch completion windows are too slow. |
| Nightly review of 8,000 invoice PDFs | Provider batch endpoint if file size and request count fit | The work is offline, repeatable, and easier to audit by custom ID. |
| Chart extraction from a CFO board deck | Table extraction plus vision review | The model can explain the trend, but source data should decide the numbers. |
| Regulated contract review | Synchronous or batch plus mandatory human review | The model output should be evidence, not the final legal decision. |
Evaluation Test Set and Scoring
Build the test set before choosing a model. Write the expected answer first, including page number, table row, chart label, visible UI element, confidence, and the human-review rule. If you need public filing shapes without client data, use public-company examples from SEC EDGAR.[14]
- A clean text PDF with tables, such as a 10-K excerpt where the answer depends on a named row and fiscal year.
- A scanned or image-heavy PDF where the correct result should trigger OCR fallback or “not readable” uncertainty.
- A dashboard screenshot with filters, visible labels, and at least one disabled or selected control.
- A chart with small axis labels, a legend, and units that change the meaning of the answer.
- A document that requires cross-page comparison, such as a customer name on page 2 and renewal terms on page 9.
- A noisy file with logos, sidebars, comments, or irrelevant screenshots that should not enter the answer.
Run each example synchronously through each candidate model with the same prompt and the same requested JSON fields: answer, evidence, page_or_region, confidence, and needs_human_review. Grade against the answer key before looking at cost. A cheaper model that loses units or source location should not get production traffic for that class.
- Evidence location: require page number, section title, table row, or visible screen region. For a 10-K, “revenue increased” is not enough; the answer should point to the MD&A table or footnote it used.
- Numeric fidelity: preserve currency, percentage signs, date ranges, negative signs, and units. A model that reads “$1.2 million” as “1.2 billion” fails the chart or table class.
- Layout understanding: verify that headers, footnotes, and sidebars are not merged into the wrong section.
- Uncertainty behavior: mark the answer wrong if the model guesses when the axis label, scan, or UI state is unreadable.
- Structured output: require the same schema every time so the product can route low-confidence answers to review.
- File fit: compare real files against provider limits before choosing a model. A 120-page PDF exceeds Anthropic’s direct PDF support limit, while Vertex AI document understanding lists larger per-file page limits for several Gemini models.[1][2]
- Benchmark fit: use public benchmark fields only as screeners. A general reasoning score is not a chart-reading score unless it predicts success on your own PDFs, charts, and screenshots.
Use this decision rule tomorrow: route live screenshot questions synchronously, route offline homogeneous backlogs to batch if they fit the provider envelope, and route exact chart or table answers through extraction plus model review. If a model cannot cite the source area and preserve the unit, it is not ready for that review class.
FAQ
When should you not trust vision-only extraction?
Do not trust vision-only extraction when the answer affects billing, revenue recognition, compliance review, contract terms, or financial reporting. Keep OCR, PDF parsing, table extraction, or source-data retrieval in the pipeline and ask the model to reason over extracted evidence.
What is the most common evaluation mistake?
The common mistake is grading only the final answer. For multimodal review, also grade evidence location, unit preservation, selected filters, disabled controls, uncertainty behavior, and whether the model returned the schema your product needs.
Which provider edge case changes routing fastest?
File fit changes routing fastest. A model can be the best reader and still be the wrong route if the PDF exceeds page limits, the batch file exceeds request caps, or the completion window is too slow for the user workflow.
Sources
- Anthropic PDF support: https://docs.anthropic.com/en/docs/build-with-claude/pdf-support
- Vertex AI document understanding: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/document-understanding
- OpenAI Responses API reference: https://platform.openai.com/docs/api-reference/responses
- OpenAI images and vision guide: https://platform.openai.com/docs/guides/images-vision
- OpenAI function calling guide: https://platform.openai.com/docs/guides/function-calling
- Anthropic Claude tool use overview: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/overview
- Anthropic Message Batches API reference: https://docs.anthropic.com/en/api/creating-message-batches
- Anthropic batch processing docs: https://docs.anthropic.com/en/docs/build-with-claude/batch-processing
- Vertex AI batch inference for Gemini: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini
- Amazon Bedrock batch inference: https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html
- Amazon Bedrock foundation model information: https://docs.aws.amazon.com/bedrock/latest/userguide/models-get-info.html
- Azure OpenAI batch docs: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/batch
- Azure OpenAI quotas and limits: https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
- SEC EDGAR search: https://www.sec.gov/edgar/search/