CONTROLLED BENCHMARK

96.7% Accuracy.
0.8% Hallucination Rate.

We ran a controlled benchmark — 120 questions across real financial services documents — comparing BackPro against Standard RAG and a Plain LLM. The results speak for themselves.

97% fewer hallucinations vs RAG
3.2× higher accuracy
98.3% proper refusal rate
97.9 composite reliability score
Book a Demo

Why Hallucinations Are a Material Risk

In consumer apps, a hallucination is an inconvenience. In financial services, it's a regulatory breach waiting to happen.

Fabricated Compliance Data

A DDQ response claims "APRA CPS 234 cybersecurity penetration testing" exists when it doesn't. That's misleading conduct under the Corporations Act.

Invented Financial Metrics

An AI fabricates a Sharpe ratio or AUM figure that ends up in investor-facing materials. No document ever contained that number.

Phantom Policies

A compliance response references a "Board-approved ESG framework adopted in Q2 2024" that was never created. Audit trail leads nowhere.

Standard RAG Actually Makes Hallucinations Worse

Our benchmark found that Standard RAG increased hallucination rates on answerable questions from 11.7% (Plain LLM) to 25.0%. Retrieved context that is superficially relevant but not precisely on-topic encourages the model to fabricate details that blend real and imagined information.

Benchmark Results

120 questions. 3 systems. Same LLM, same documents, same infrastructure. Only the retrieval pipeline differs.

Answerable Questions — Factual Accuracy

SystemAccuracyHallucination Rate
Plain LLM (No Retrieval)28.3%11.7%
Standard RAG30.0%25.0%
BackPro96.7%1.7%

Unanswerable Questions — Refusal Integrity

The most critical test for enterprise deployment: when a system cannot answer from available documents, the only correct behaviour is to say so.

SystemProper Refusal RateHallucination Rate
Plain LLM70.0%20.0%
Standard RAG55.0%31.7%
BackPro98.3%0.0%

Overall Hallucination Rate: 0.8%

1 borderline case across all 120 questions — a 97% reduction compared to Standard RAG (28.3%)

See the Difference

Real questions from our benchmark. See how each system responds — and where the others fabricate.

ANSWERABLE QUESTION

“How is performance monitored and by whom?”

Plain LLM — Hallucinated

“Performance is typically monitored by the portfolio manager and risk team using Bloomberg PORT and FactSet...” — Fabricated tools not used by the firm

Standard RAG — Incorrect

“The compliance team oversees performance measurement...” — Retrieved wrong chunk about compliance, not performance

BackPro — Correct

“Performance is monitored daily by the investment team using internal attribution tools. The CIO conducts weekly reviews against the benchmark, and the Board receives quarterly reports from an independent third-party provider.”

UNANSWERABLE QUESTION

“What was the fund's Sharpe ratio for Q3 2024?”

No Sharpe ratio data for any period exists in the document corpus.

Plain LLM — Hallucinated

“The fund's Sharpe ratio for Q3 2024 was approximately 1.42, reflecting strong risk-adjusted returns...” — Entirely fabricated

Standard RAG — Hallucinated

“Based on the available performance data, the Sharpe ratio was 0.87 for the September quarter...” — Fabricated from an unrelated performance chunk

BackPro — Proper Refusal

“I was unable to find information about the fund's Sharpe ratio for Q3 2024 in the available documents. This specific metric does not appear in any of the uploaded DDQs or compliance documents.”

Why BackPro Performs Better

The performance differential is architectural, not incremental. BackPro's pipeline addresses each failure mode that standard approaches miss.

Standard RAG

Embed Query
Vector Search (top-5)
Inject Chunks
LLM Generate
Return Answer

No verification. No confidence scoring. No document-aware retrieval.

BackPro Pipeline

Hybrid RetrievalText + visual embeddings (ColPali)
DDQ QA MatchingPre-verified question–answer pairs
Smart ExtractionDocument-structure-guided extraction
Relevance VerificationSecond LLM verifies answer relevance
Confidence ScoringThreshold-based answer or refusal

Every answer traceable to a specific document, page, and extraction method.

The Cost of Hallucination

Estimated annual risk exposure for a firm processing 100 DDQs per year, 50 questions each.

MetricStandard RAGBackProReduction
Hallucinated answers / year1,3504097.0%
Potential regulatory incidents135497.0%
Review hours saved2,800 hrs

Built for Regulatory Compliance

APRA CPS 220

Risk Management

A 0.8% hallucination rate is within acceptable tolerance for a human-reviewed workflow.

ASIC RG 271

Dispute Resolution

Accurate, auditable responses reduce complaints from incorrect information.

APRA CPS 234

Information Security

Confidence scoring provides a quantifiable risk metric for each response.

Get the Full Benchmark

14 pages of methodology, detailed results, illustrative examples, and architectural analysis. Download the complete whitepaper.

Book a Demo