Multimodal Models Compared for PDF, Chart, and Screenshot Review

By Deep Digital Ventures Editorial Team · May 2, 2026

Deep Digital Ventures publishes product education, research explainers, and data-driven articles related to its software tools. This article was prepared by our editorial team using the sources listed below and reviewed for factual accuracy before publication.

This comparison is about which multimodal models should review PDFs, charts, and screenshots in production, not which provider has the longest spec sheet. The useful question is whether a model can read the evidence, preserve the number or UI state, and return a result your product can route safely.

As of 2026-04-23, the pricing, limits, and behaviors below are summarized from provider docs and DDV spot tests; provider pricing and model availability change frequently, so verify the linked sources before quoting in a contract, RFP, or cost plan.

The comparison covers three jobs: PDF review, chart extraction, and screenshot review. You should leave with a routing rule: use PDF-aware models when page structure matters, pair chart vision with source data when exact numbers matter, keep live screenshot questions synchronous, and move only offline homogeneous work to batch.

Before building the test harness, use Deep Digital Ventures AI Models to filter candidates by modality, context window, public benchmark fields, pricing per million input and output tokens, and cost-estimator scenarios. The article can explain the decision; the tool lets you change token volumes, compare candidates side by side, and shortlist models for your own files.

PDFs, Charts, and Screenshots Fail Differently

A SEC Form 10-K, an ASC 606 revenue-recognition footnote, a Power BI screenshot, and a scanned invoice can all enter the same “document review” queue, but they do not fail the same way. The model that summarizes a clean text PDF well may still misread a chart axis, miss a disabled UI control, or invent a number when the scan is low quality.

Task	Recognizable input	Model must handle	Failure risk	Routing hint
PDF review	SEC Form 10-K annual report, contract exhibit, invoice PDF	Text layer, reading order, headings, footnotes, citations, cross-page context	Missing a key clause, losing a footnote, or mixing two page sections	Use a PDF-aware model only when the file fits provider page and size limits; otherwise split with stable page IDs.
Chart extraction	ASC 606 revenue table, ARR chart, board-deck dashboard	Axes, labels, trend direction, units, legends, visible data labels	Returning the right trend but the wrong number, unit, or time period	Use vision for interpretation, but preserve the source table, CSV, XBRL, or spreadsheet when exact values matter.
Screenshot review	Power BI dashboard, billing screen, support ticket screenshot	Visible text, UI state, selected filters, disabled controls, error banners, spatial relationships	Misidentifying a button, missing a selected filter, or treating hidden state as visible evidence	Use synchronous review for user-facing support and batch only for offline QA or backlog labeling.

What Our Spot Tests Showed

DDV ran a compact April 2026 evaluation before writing this post: 50 files total, with 10 clean PDFs, 10 scanned PDFs, 10 chart images, 10 dashboard screenshots, and 10 cross-page documents. Each example had a written answer key with the expected answer, evidence location, page or region, unit, and whether a human review flag should be returned.

The exact tasks were narrow on purpose: identify the revenue row and fiscal year in a filing excerpt, return “not readable” for poor scans, extract a chart value with its unit, name a selected dashboard filter and disabled control, and compare renewal terms across two pages. That made the failures easier to separate from prompt style.

Model route tested	Best result	Common miss	Observed tradeoff
OpenAI vision route through Responses	Strong screenshot reasoning when the prompt required visible evidence and a structured JSON answer.	Occasionally described the right UI state but failed to name the exact selected filter when the dashboard was crowded.	Best fit where the visual answer needs tool calls or policy checks in the same workflow.
Claude PDF route	Strong on long-form PDF questions where headings, footnotes, and charts appeared inside the same document.	Low-quality scans still needed OCR fallback; otherwise the model sometimes answered from plausible context instead of readable evidence.	Best fit for PDF-native review, with page and file limits checked before routing.
Gemini document route on Vertex AI	Handled larger PDF envelopes and offline document batches cleanly in the test plan.	Chart answers still needed source data validation when the visual label was small or the unit changed across panels.	Best fit when long PDFs and cloud batch operations are already part of the pipeline.
Extraction plus model review	Most reliable for CFO charts, revenue tables, and invoices where exact values mattered.	Required more engineering work: PDF parsing, OCR, table extraction, and deterministic checks before the model call.	Higher pipeline complexity, but fewer numeric and unit errors than vision-only review.

The failures were more useful than the wins. Vision-only review was usually good at trend language and screenshots with clear labels, but it was weaker when the answer depended on a tiny axis label, a negative sign, a disabled control, or a number copied from a dense table. The cost and latency tradeoff was straightforward: batch made sense only when delay was acceptable and the files were homogeneous enough to audit by custom ID.

Best Model Traits for PDF Review

For PDF review, compare models on reading order, table handling, page citations, footnotes, scan quality, and file-fit limits. Claude deserves a separate PDF test because Anthropic documents PDF support for text, pictures, charts, and tables in PDFs, with a 32 MB maximum request size and 100 pages per request.^[1] Gemini on Vertex AI becomes important when long PDFs are common because the Vertex document-understanding page lists larger model-specific PDF envelopes for Gemini 2.5 Pro and Gemini 2.5 Flash.^[2]

The OpenAI path fits application workflows where the visual answer may need a tool call. The Responses API supports text and image inputs with text or JSON outputs, built-in tools, and function calling.^[3] Image inputs are metered in tokens, and OpenAI’s function calling guide shows how an application can give the model JSON-schema tools.^[4]^[5] If a screenshot review must call a policy checker, fetch customer metadata, or open an internal runbook, test that tool flow, not only the visual answer.

What Matters for Chart Extraction

For charts and tables, vision should not be the only source of truth when precision matters. A stronger workflow extracts the text layer, OCR output, table cells, CSV, XBRL value, or spreadsheet first, then asks the model to compare that source data with the rendered page or screenshot.

The practical pipeline is simple: parse the PDF or spreadsheet, attach the rendered page image, ask the model to separate “quoted evidence,” “calculation,” and “inference,” and use tool calling for deterministic checks such as totals, percentages, and date comparisons. OpenAI documents this pattern through function calling, and Anthropic documents it through Claude tool use.^[5]^[6] The model can explain the result, but the arithmetic should come from code or the original data.

When Batch Beats Synchronous Review

Batch is a cost and throughput tool, not a default architecture. Use it for nightly invoice extraction, backfill labeling, regression review, and benchmark runs. Keep live screenshot support, interactive contract review, and user-facing error diagnosis on synchronous endpoints unless the product explicitly tells the user the answer will arrive later.

Provider route	Decision-critical dated limits and behavior as of 2026-04-23	Best use
OpenAI Responses and Batch	Responses supports text and image inputs, JSON outputs, built-in tools, and function calling. Batch supports offline request files; verify current request and file-size envelopes before submitting.	Interactive workflows where the visual answer may need a tool call, plus offline jobs that fit the batch file envelope.
Anthropic Claude PDF and Message Batches	PDF support page lists 32 MB and 100 pages per request. Message Batches may contain up to 100,000 requests and 256 MB, can take up to 24 hours, and Anthropic documents batch usage at 50% of standard API prices.^[1]^[7]^[8]	PDF-heavy review and offline document jobs where 24-hour completion is acceptable.
Google Vertex AI Gemini	Document understanding lists model-specific PDF limits, including larger file and page envelopes for Gemini 2.5 models. Batch inference docs state a 50% discounted rate, up to 200,000 requests in a single batch job, a 1 GB Cloud Storage input file limit, possible queueing for up to 72 hours under high traffic, and exclusion from the Service Level Objective of any SLA. Cache-hit discounts do not stack with batch; the cache hit discount takes precedence.^[2]^[9]	Long PDFs, large offline jobs, and cloud-native pipelines already running on Vertex AI.
Amazon Bedrock batch	Batch inference uses S3 input and output for asynchronous jobs and is not supported for provisioned models. Model IDs, input modalities, streaming support, and quotas are exposed through Bedrock foundation model information docs.^[10]^[11]	AWS-controlled pipelines where governance, storage, and model access are centered on Bedrock.
Azure OpenAI Batch	Azure OpenAI batch docs describe a 24-hour target turnaround and 50% lower cost than global standard. The quotas page lists batch limits such as 100,000 requests per file and 200 MB maximum input file size.^[12]^[13]	Azure-controlled workloads where discounted offline processing matters more than immediate response.

Workload	Endpoint choice	Reason
User asks “what does this error screenshot mean?” in a live support flow	Synchronous vision request	The answer is needed now, and batch completion windows are too slow.
Nightly review of 8,000 invoice PDFs	Provider batch endpoint if file size and request count fit	The work is offline, repeatable, and easier to audit by custom ID.
Chart extraction from a CFO board deck	Table extraction plus vision review	The model can explain the trend, but source data should decide the numbers.
Regulated contract review	Synchronous or batch plus mandatory human review	The model output should be evidence, not the final legal decision.

Evaluation Test Set and Scoring

Build the test set before choosing a model. Write the expected answer first, including page number, table row, chart label, visible UI element, confidence, and the human-review rule. If you need public filing shapes without client data, use public-company examples from SEC EDGAR.^[14]

A clean text PDF with tables, such as a 10-K excerpt where the answer depends on a named row and fiscal year.
A scanned or image-heavy PDF where the correct result should trigger OCR fallback or “not readable” uncertainty.
A dashboard screenshot with filters, visible labels, and at least one disabled or selected control.
A chart with small axis labels, a legend, and units that change the meaning of the answer.
A document that requires cross-page comparison, such as a customer name on page 2 and renewal terms on page 9.
A noisy file with logos, sidebars, comments, or irrelevant screenshots that should not enter the answer.

Run each example synchronously through each candidate model with the same prompt and the same requested JSON fields: answer, evidence, page_or_region, confidence, and needs_human_review. Grade against the answer key before looking at cost. A cheaper model that loses units or source location should not get production traffic for that class.

Evidence location: require page number, section title, table row, or visible screen region. For a 10-K, “revenue increased” is not enough; the answer should point to the MD&A table or footnote it used.
Numeric fidelity: preserve currency, percentage signs, date ranges, negative signs, and units. A model that reads “$1.2 million” as “1.2 billion” fails the chart or table class.
Layout understanding: verify that headers, footnotes, and sidebars are not merged into the wrong section.
Uncertainty behavior: mark the answer wrong if the model guesses when the axis label, scan, or UI state is unreadable.
Structured output: require the same schema every time so the product can route low-confidence answers to review.
File fit: compare real files against provider limits before choosing a model. A 120-page PDF exceeds Anthropic’s direct PDF support limit, while Vertex AI document understanding lists larger per-file page limits for several Gemini models.^[1]^[2]
Benchmark fit: use public benchmark fields only as screeners. A general reasoning score is not a chart-reading score unless it predicts success on your own PDFs, charts, and screenshots.

Use this decision rule tomorrow: route live screenshot questions synchronously, route offline homogeneous backlogs to batch if they fit the provider envelope, and route exact chart or table answers through extraction plus model review. If a model cannot cite the source area and preserve the unit, it is not ready for that review class.

FAQ

When should you not trust vision-only extraction?

Do not trust vision-only extraction when the answer affects billing, revenue recognition, compliance review, contract terms, or financial reporting. Keep OCR, PDF parsing, table extraction, or source-data retrieval in the pipeline and ask the model to reason over extracted evidence.

What is the most common evaluation mistake?

The common mistake is grading only the final answer. For multimodal review, also grade evidence location, unit preservation, selected filters, disabled controls, uncertainty behavior, and whether the model returned the schema your product needs.

Which provider edge case changes routing fastest?

File fit changes routing fastest. A model can be the best reader and still be the wrong route if the PDF exceeds page limits, the batch file exceeds request caps, or the completion window is too slow for the user workflow.

Sources

Anthropic PDF support: https://docs.anthropic.com/en/docs/build-with-claude/pdf-support
Vertex AI document understanding: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/document-understanding
OpenAI Responses API reference: https://platform.openai.com/docs/api-reference/responses
OpenAI images and vision guide: https://platform.openai.com/docs/guides/images-vision
OpenAI function calling guide: https://platform.openai.com/docs/guides/function-calling
Anthropic Claude tool use overview: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/overview
Anthropic Message Batches API reference: https://docs.anthropic.com/en/api/creating-message-batches
Anthropic batch processing docs: https://docs.anthropic.com/en/docs/build-with-claude/batch-processing
Vertex AI batch inference for Gemini: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini
Amazon Bedrock batch inference: https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html
Amazon Bedrock foundation model information: https://docs.aws.amazon.com/bedrock/latest/userguide/models-get-info.html
Azure OpenAI batch docs: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/batch
Azure OpenAI quotas and limits: https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
SEC EDGAR search: https://www.sec.gov/edgar/search/