Latest results
| Document class | Fixture size | F1 | p50 latency | Refusal rate |
|---|---|---|---|---|
| Invoice (US, English) | 500 | 96.2% | 1.4 s | 0.4% |
| K-1 / K-3 partnership tax | 200 | 94.5% | 2.1 s | 0.6% |
| Bank statement (multi-page) | 250 | 93.1% | 2.8 s | 1.0% |
| Commercial lease (60+ pages) | 150 | 91.7% | 4.2 s | 1.3% |
| MSA / Master Services Agreement | 180 | 92.8% | 3.6 s | 0.9% |
| Superbill / EOB | 220 | 93.4% | 2.0 s | 0.7% |
Methodology
- F1 is computed at the field level. A field is "correct" if it matches the human-verified gold-standard value (after normalization for dates, currency, and whitespace).
- Latency is end-to-end (multipart upload → parsed JSON response), measured at the public API endpoint, p50.
- Refusal rate is the fraction of fixture documents the model declines to extract from (e.g., severely degraded scans). Refusals return a structured error, not a hallucinated record.
- Fixtures and the scoring harness are versioned. Re-runs against historical fixtures land in the changelog so quality regressions are visible.
Reproducibility
The fixture set, gold-standard answers, and scoring harness are versioned internally and available to customers on request. Email sales@ordalis.io for access — we share the harness so you can re-run the suite against your own API key and confirm the numbers.
Caveats
- Numbers are document-class averages — your specific workload may differ.
- F1 doesn't capture every quality signal. Per-field confidence is the production-time signal we recommend you branch on.
- We re-run benchmarks on every model update so you can see the rolling history. The changelog is the canonical record.