Published Research · March 2026

Hallucination Benchmark
for Financial Document AI

We tested 120 real DDQ questions across three AI systems on the same documents, same infrastructure, same LLM. Only the retrieval pipeline differs.

Read the Paper Full Methodology

96.7%Accuracyvs 30% standard

96.7%Accuracyon answerable DDQ questionsvs 30% for standard approaches

0.8%Hallucination Rateacross all 120 questionsvs 25% for standard RAG

98.3%Refusal Integritydeclines when evidence is absentvs 55% for standard approaches

Key Finding

Standard Retrieval Made
Hallucinations Worse

Standard RAG retrieved superficially relevant passages that encouraged the model to blend real information with fabricated details.

Plain LLM11.7%fabrication rate

+114% INCREASE

Standard RAG25.0%fabrication rate

97% REDUCTION

BackPro0.8%fabrication rate

Same 120 questions. Same documents. Same LLM. Same infrastructure. Only the retrieval pipeline differs.

Benchmark Methodology

120DDQ questions tested

3Systems compared

60 / 60Answerable / unanswerable

1Controlled variable

Detailed Results

Benchmark Data

Interactive charts comparing all three systems across four dimensions.

Factual Accuracy on Answerable Questions

Percentage of correct, evidence-based answers (n=60)

BackPro answered 96.7% of DDQ questions correctly — compared to 30% for standard retrieval. Same documents, same LLM, different pipeline.

Architecture

Why BackPro Achieves
These Results

A four-stage pipeline designed to prevent fabrication at each step.

Step 01

Hybrid Retrieval

Text and visual embeddings (ColPali) locate the precise page, table, or chart, not just semantically similar text.

Step 02

Pre-Verified Matching

Questions matched against a curated answer bank before extraction begins, eliminating redundant processing.

Step 03

Structure-Guided Extraction

Answers extracted using the document's own structure (headers, tables, footnotes) preserving context.

Step 04

Second-Pass Verification

A separate LLM pass cross-checks the extracted answer against the source. Low confidence triggers a decline.

From Benchmark to Production

What This Means in Practice

DDQ Processing

Manual processing time2–3 days

With BackProUnder 1 hour

Accuracy maintained96.7%

Verified with lighthouse clients including Selector Funds Management.

Risk Reduction

Standard RAG fabrication25.0%

BackPro fabrication0.8%

Per 100 DDQs processed~24 fewer errors

Each fabricated DDQ response carries regulatory exposure under ASIC RG 271 and APRA CPS 220.

Publication

The Hallucination
Benchmark

Full methodology, controlled test design, and detailed results across all 120 questions.

120 DDQ QuestionsControlled BenchmarkReproducibleFinancial Services

Download Paper Read Online

Technical Deep Dive

Full methodology, result tables, architecture comparison, and real-world example Q&As.

Summary for Firms

Plain-language version of the benchmark results, designed for decision-makers evaluating AI solutions.

Hallucination Benchmarkfor Financial Document AI

Standard Retrieval MadeHallucinations Worse