Benchmarks — Ordalis

Latest results

70 fixtures across 6 regulated document classes, run against the production API on the current extraction engine. Overall field-level F1: 96.5%. Required-fields F1: 98.9%. 0 of 70 conversions failed.

Document class	Fixtures	Required-fields F1	Overall F1
Invoice	15	100.0%	100.0%
Schedule K-1	10	100.0%	100.0%
Bank statement	10	100.0%	98.8%
Contract	15	100.0%	100.0%
Audit report	10	92.5%	88.6%
Engagement letter	10	100.0%	85.7%*

*Engagement-letter misses on this run are strict-comparison mismatches on one long free-text field (the scope-of-engagement paragraph), where the model paraphrases capitalization and punctuation — not missing or wrong data. The prior run's known gap (nested sub-fields such as currency codes on rate tables) is fixed as of 2026-07-06: the extractor now receives the full nested schema shape. Audit-report misses are concentrated in two fields: a going-concern flag the model leaves null instead of false, and opinion-type wording that doesn't match the enum exactly.

Methodology

F1 is computed at the field level. A field is "correct" if it matches the gold-standard value (after normalization for dates, currency, and whitespace). Required-fields F1 covers the fields marked required in each document class's schema — the ones downstream workflows depend on.
Fixtures are deterministic synthetic documents for six regulated templates (invoice, bank statement, contract, engagement letter, Schedule K-1, audit report), generated and scored by the harness in our repo. Synthetic means we own the ground truth exactly; it also means these are clean single-page text documents — real scans and long multi-page documents are harder, and per-field confidence is the production-time signal to branch on.
Latency: end-to-end (upload → completed result) averaged 17.5 s per document on this run, ranging 10–33 s — slower than the prior run's 10.4 s average; the accuracy gains above come from a richer extraction schema, which costs some latency. The dashboard shows live progress and an ETA per conversion.
Fixtures and the scoring harness are versioned. Re-runs against historical fixtures land in the changelog so quality regressions are visible.

Reproducibility

The fixture set, gold-standard answers, and scoring harness are versioned internally and available to customers on request. Email sales@ordalis.io for access — we share the harness so you can re-run the suite against your own API key and confirm the numbers.

Caveats

Numbers are document-class averages — your specific workload may differ.
F1 doesn't capture every quality signal. Per-field confidence is the production-time signal we recommend you branch on.
We re-run benchmarks on every model update so you can see the rolling history. The changelog is the canonical record.

Get this as a 1-pager

The benchmark numbers and methodology, summarized for sharing — sent to your inbox.

Extraction quality, in public

Latest results

Methodology

Reproducibility

Caveats

Get this as a 1-pager