Hallucination Benchmark
for Financial Document AI
We tested 120 real DDQ questions across three AI systems on the same documents, same infrastructure, same LLM. Only the retrieval pipeline differs.
Standard Retrieval Made
Hallucinations Worse
Standard RAG retrieved superficially relevant passages that encouraged the model to blend real information with fabricated details.
Same 120 questions. Same documents. Same LLM. Same infrastructure. Only the retrieval pipeline differs.
Benchmark Methodology
Detailed Results
Benchmark Data
Interactive charts comparing all three systems across four dimensions.
Factual Accuracy on Answerable Questions
Percentage of correct, evidence-based answers (n=60)
Architecture
Why BackPro Achieves
These Results
A four-stage pipeline designed to prevent fabrication at each step.
Hybrid Retrieval
Text and visual embeddings (ColPali) locate the precise page, table, or chart, not just semantically similar text.
Pre-Verified Matching
Questions matched against a curated answer bank before extraction begins, eliminating redundant processing.
Structure-Guided Extraction
Answers extracted using the document's own structure (headers, tables, footnotes) preserving context.
Second-Pass Verification
A separate LLM pass cross-checks the extracted answer against the source. Low confidence triggers a decline.
From Benchmark to Production
What This Means in Practice
DDQ Processing
Verified with lighthouse clients including Selector Funds Management.
Risk Reduction
Each fabricated DDQ response carries regulatory exposure under ASIC RG 271 and APRA CPS 220.
Publication
The Hallucination
Benchmark
Full methodology, controlled test design, and detailed results across all 120 questions.