96.7% Accuracy.
0.8% Hallucination Rate.
We ran a controlled benchmark with 120 questions across real financial services documents, comparing BackPro against Standard RAG and a Plain LLM. The results speak for themselves.
Looking for a non-technical summary? Read the version for firms →
Why Hallucinations Are a Material Risk
In consumer apps, a hallucination is an inconvenience. In financial services, it's a regulatory breach waiting to happen.
Fabricated Compliance Data
"APRA CPS 234 cybersecurity penetration testing is conducted quarterly by an independent assessor."
No such testing exists. Misleading conduct under the Corporations Act.
Invented Financial Metrics
"The fund's Sharpe ratio for Q3 2024 was 1.42, reflecting strong risk-adjusted returns."
No Sharpe ratio data exists in any uploaded document. Entirely fabricated.
Phantom Policies
"Board-approved ESG framework adopted in Q2 2024 governs all investment screening."
This policy was never created. Audit trail leads nowhere.
Standard RAG Actually Makes Hallucinations Worse
Our benchmark found that Standard RAG increased hallucination rates on answerable questions from 11.7% (Plain LLM) to 25.0%. Retrieved context that is superficially relevant but not precisely on-topic encourages the model to fabricate details that blend real and imagined information.
Hallucination Rate
Benchmark Results
120 questions. 3 systems. Same LLM, same documents, same infrastructure. Only the retrieval pipeline differs.
Answerable Questions: Factual Accuracy
| System | Accuracy | Hallucination Rate |
|---|---|---|
| Plain LLM (No Retrieval) | 28.3% | 11.7% |
| Standard RAG | 30.0% | 25.0% |
| BackPro | 96.7% | 1.7% |
Unanswerable Questions: Refusal Integrity
The most critical test for enterprise deployment: when a system cannot answer from available documents, the only correct behaviour is to say so.
| System | Proper Refusal Rate | Hallucination Rate |
|---|---|---|
| Plain LLM | 70.0% | 20.0% |
| Standard RAG | 55.0% | 31.7% |
| BackPro | 98.3% | 0.0% |
Overall Hallucination Rate: 0.8%
1 borderline case across all 120 questions, a 97% reduction compared to Standard RAG (28.3%)
See the Difference
Real questions from our benchmark. See how each system responds and where the others fabricate.
“How is performance monitored and by whom?”
Plain LLM: Hallucinated
“Performance is typically monitored by the portfolio manager and risk team using Bloomberg PORT and FactSet...” Fabricated tools not used by the firm.
Standard RAG: Incorrect
“The compliance team oversees performance measurement...” Retrieved wrong chunk about compliance, not performance.
BackPro: Correct
“Performance is monitored daily by the investment team using internal attribution tools. The CIO conducts weekly reviews against the benchmark, and the Board receives quarterly reports from an independent third-party provider.”
“What was the fund's Sharpe ratio for Q3 2024?”
No Sharpe ratio data for any period exists in the document corpus.
Plain LLM: Hallucinated
“The fund's Sharpe ratio for Q3 2024 was approximately 1.42, reflecting strong risk-adjusted returns...” Entirely fabricated.
Standard RAG: Hallucinated
“Based on the available performance data, the Sharpe ratio was 0.87 for the September quarter...” Fabricated from an unrelated performance chunk.
BackPro: Proper Refusal
“I was unable to find information about the fund's Sharpe ratio for Q3 2024 in the available documents. This specific metric does not appear in any of the uploaded DDQs or compliance documents.”
Why BackPro Performs Better
The performance differential is architectural, not incremental. BackPro's pipeline addresses each failure mode that standard approaches miss.
Standard RAG
No verification. No confidence scoring. No document-aware retrieval.
BackPro Pipeline
Every answer traceable to a specific document, page, and extraction method.
Annual Risk Exposure
The Cost of Hallucination
For a firm processing 100 DDQs per year, 50 questions each.
Based on benchmark hallucination rates applied to 5,000 annual questions. Regulatory incident estimate assumes 10% of fabricated answers carry compliance exposure.
Get the Full Benchmark
14 pages of methodology, detailed results, illustrative examples, and architectural analysis. Download the complete whitepaper.