We ran a controlled benchmark — 120 questions across real financial services documents — comparing BackPro against Standard RAG and a Plain LLM. The results speak for themselves.
In consumer apps, a hallucination is an inconvenience. In financial services, it's a regulatory breach waiting to happen.
A DDQ response claims "APRA CPS 234 cybersecurity penetration testing" exists when it doesn't. That's misleading conduct under the Corporations Act.
An AI fabricates a Sharpe ratio or AUM figure that ends up in investor-facing materials. No document ever contained that number.
A compliance response references a "Board-approved ESG framework adopted in Q2 2024" that was never created. Audit trail leads nowhere.
Our benchmark found that Standard RAG increased hallucination rates on answerable questions from 11.7% (Plain LLM) to 25.0%. Retrieved context that is superficially relevant but not precisely on-topic encourages the model to fabricate details that blend real and imagined information.
120 questions. 3 systems. Same LLM, same documents, same infrastructure. Only the retrieval pipeline differs.
| System | Accuracy | Hallucination Rate |
|---|---|---|
| Plain LLM (No Retrieval) | 28.3% | 11.7% |
| Standard RAG | 30.0% | 25.0% |
| BackPro | 96.7% | 1.7% |
The most critical test for enterprise deployment: when a system cannot answer from available documents, the only correct behaviour is to say so.
| System | Proper Refusal Rate | Hallucination Rate |
|---|---|---|
| Plain LLM | 70.0% | 20.0% |
| Standard RAG | 55.0% | 31.7% |
| BackPro | 98.3% | 0.0% |
Overall Hallucination Rate: 0.8%
1 borderline case across all 120 questions — a 97% reduction compared to Standard RAG (28.3%)
Real questions from our benchmark. See how each system responds — and where the others fabricate.
“How is performance monitored and by whom?”
Plain LLM — Hallucinated
“Performance is typically monitored by the portfolio manager and risk team using Bloomberg PORT and FactSet...” — Fabricated tools not used by the firm
Standard RAG — Incorrect
“The compliance team oversees performance measurement...” — Retrieved wrong chunk about compliance, not performance
BackPro — Correct
“Performance is monitored daily by the investment team using internal attribution tools. The CIO conducts weekly reviews against the benchmark, and the Board receives quarterly reports from an independent third-party provider.”
“What was the fund's Sharpe ratio for Q3 2024?”
No Sharpe ratio data for any period exists in the document corpus.
Plain LLM — Hallucinated
“The fund's Sharpe ratio for Q3 2024 was approximately 1.42, reflecting strong risk-adjusted returns...” — Entirely fabricated
Standard RAG — Hallucinated
“Based on the available performance data, the Sharpe ratio was 0.87 for the September quarter...” — Fabricated from an unrelated performance chunk
BackPro — Proper Refusal
“I was unable to find information about the fund's Sharpe ratio for Q3 2024 in the available documents. This specific metric does not appear in any of the uploaded DDQs or compliance documents.”
The performance differential is architectural, not incremental. BackPro's pipeline addresses each failure mode that standard approaches miss.
No verification. No confidence scoring. No document-aware retrieval.
Every answer traceable to a specific document, page, and extraction method.
Estimated annual risk exposure for a firm processing 100 DDQs per year, 50 questions each.
| Metric | Standard RAG | BackPro | Reduction |
|---|---|---|---|
| Hallucinated answers / year | 1,350 | 40 | 97.0% |
| Potential regulatory incidents | 135 | 4 | 97.0% |
| Review hours saved | — | 2,800 hrs | — |
Risk Management
A 0.8% hallucination rate is within acceptable tolerance for a human-reviewed workflow.
Dispute Resolution
Accurate, auditable responses reduce complaints from incorrect information.
Information Security
Confidence scoring provides a quantifiable risk metric for each response.
14 pages of methodology, detailed results, illustrative examples, and architectural analysis. Download the complete whitepaper.