Published Research · March 2026

Hallucination Benchmark
for Financial Document AI

We tested 120 real DDQ questions across three AI systems on the same documents, same infrastructure, same LLM. Only the retrieval pipeline differs.

96.7%Accuracyon answerable DDQ questionsvs 30% for standard approaches
0.8%Hallucination Rateacross all 120 questionsvs 25% for standard RAG
98.3%Refusal Integritydeclines when evidence is absentvs 55% for standard approaches
Key Finding

Standard Retrieval Made
Hallucinations Worse

Standard RAG retrieved superficially relevant passages that encouraged the model to blend real information with fabricated details.

Plain LLM11.7%fabrication rate
+114% INCREASE
Standard RAG25.0%fabrication rate
97% REDUCTION
BackPro0.8%fabrication rate

Same 120 questions. Same documents. Same LLM. Same infrastructure. Only the retrieval pipeline differs.

Benchmark Methodology

120DDQ questions tested
3Systems compared
60 / 60Answerable / unanswerable
1Controlled variable

Detailed Results

Benchmark Data

Interactive charts comparing all three systems across four dimensions.

Factual Accuracy on Answerable Questions

Percentage of correct, evidence-based answers (n=60)

BackPro answered 96.7% of DDQ questions correctly — compared to 30% for standard retrieval. Same documents, same LLM, different pipeline.

Architecture

Why BackPro Achieves
These Results

A four-stage pipeline designed to prevent fabrication at each step.

Step 01

Hybrid Retrieval

Text and visual embeddings (ColPali) locate the precise page, table, or chart, not just semantically similar text.

Step 02

Pre-Verified Matching

Questions matched against a curated answer bank before extraction begins, eliminating redundant processing.

Step 03

Structure-Guided Extraction

Answers extracted using the document's own structure (headers, tables, footnotes) preserving context.

Step 04

Second-Pass Verification

A separate LLM pass cross-checks the extracted answer against the source. Low confidence triggers a decline.

From Benchmark to Production

What This Means in Practice

DDQ Processing

Manual processing time2–3 days
With BackProUnder 1 hour
Accuracy maintained96.7%

Verified with lighthouse clients including Selector Funds Management.

Risk Reduction

Standard RAG fabrication25.0%
BackPro fabrication0.8%
Per 100 DDQs processed~24 fewer errors

Each fabricated DDQ response carries regulatory exposure under ASIC RG 271 and APRA CPS 220.

Publication

The Hallucination
Benchmark

Full methodology, controlled test design, and detailed results across all 120 questions.

120 DDQ QuestionsControlled BenchmarkReproducibleFinancial Services