CONTROLLED BENCHMARK

96.7% Accuracy.
0.8% Hallucination Rate.

We ran a controlled benchmark with 120 questions across real financial services documents, comparing BackPro against Standard RAG and a Plain LLM. The results speak for themselves.

97% fewer hallucinations vs RAG
3.2× higher accuracy
98.3% proper refusal rate
97.9 composite reliability score

Looking for a non-technical summary? Read the version for firms →

Scroll

Why Hallucinations Are a Material Risk

In consumer apps, a hallucination is an inconvenience. In financial services, it's a regulatory breach waiting to happen.

Regulatory Risk

Fabricated Compliance Data

Fabricated AI Output

"APRA CPS 234 cybersecurity penetration testing is conducted quarterly by an independent assessor."

Fabricated

No such testing exists. Misleading conduct under the Corporations Act.

Data Integrity

Invented Financial Metrics

Fabricated AI Output

"The fund's Sharpe ratio for Q3 2024 was 1.42, reflecting strong risk-adjusted returns."

Fabricated

No Sharpe ratio data exists in any uploaded document. Entirely fabricated.

Audit Failure

Phantom Policies

Fabricated AI Output

"Board-approved ESG framework adopted in Q2 2024 governs all investment screening."

Fabricated

This policy was never created. Audit trail leads nowhere.

Key Finding

Standard RAG Actually Makes Hallucinations Worse

Our benchmark found that Standard RAG increased hallucination rates on answerable questions from 11.7% (Plain LLM) to 25.0%. Retrieved context that is superficially relevant but not precisely on-topic encourages the model to fabricate details that blend real and imagined information.

Hallucination Rate

11.7%Plain LLM
25.0%Standard RAG
0.8%BackPro

Benchmark Results

120 questions. 3 systems. Same LLM, same documents, same infrastructure. Only the retrieval pipeline differs.

Answerable Questions: Factual Accuracy

SystemAccuracyHallucination Rate
Plain LLM (No Retrieval)28.3%11.7%
Standard RAG30.0%25.0%
BackPro96.7%1.7%

Unanswerable Questions: Refusal Integrity

The most critical test for enterprise deployment: when a system cannot answer from available documents, the only correct behaviour is to say so.

SystemProper Refusal RateHallucination Rate
Plain LLM70.0%20.0%
Standard RAG55.0%31.7%
BackPro98.3%0.0%

Overall Hallucination Rate: 0.8%

1 borderline case across all 120 questions, a 97% reduction compared to Standard RAG (28.3%)

See the Difference

Real questions from our benchmark. See how each system responds and where the others fabricate.

ANSWERABLE QUESTION

“How is performance monitored and by whom?”

Plain LLM: Hallucinated

“Performance is typically monitored by the portfolio manager and risk team using Bloomberg PORT and FactSet...” Fabricated tools not used by the firm.

Standard RAG: Incorrect

“The compliance team oversees performance measurement...” Retrieved wrong chunk about compliance, not performance.

BackPro: Correct

“Performance is monitored daily by the investment team using internal attribution tools. The CIO conducts weekly reviews against the benchmark, and the Board receives quarterly reports from an independent third-party provider.”

UNANSWERABLE QUESTION

“What was the fund's Sharpe ratio for Q3 2024?”

No Sharpe ratio data for any period exists in the document corpus.

Plain LLM: Hallucinated

“The fund's Sharpe ratio for Q3 2024 was approximately 1.42, reflecting strong risk-adjusted returns...” Entirely fabricated.

Standard RAG: Hallucinated

“Based on the available performance data, the Sharpe ratio was 0.87 for the September quarter...” Fabricated from an unrelated performance chunk.

BackPro: Proper Refusal

“I was unable to find information about the fund's Sharpe ratio for Q3 2024 in the available documents. This specific metric does not appear in any of the uploaded DDQs or compliance documents.”

Why BackPro Performs Better

The performance differential is architectural, not incremental. BackPro's pipeline addresses each failure mode that standard approaches miss.

Standard RAG

Embed Query
Vector Search (top-5)
Inject Chunks
LLM Generate
Return Answer

No verification. No confidence scoring. No document-aware retrieval.

BackPro Pipeline

Hybrid Retrieval: Text + visual embeddings (ColPali)
DDQ QA Matching: Pre-verified question–answer pairs
Smart Extraction: Document-structure-guided extraction
Relevance Verification: Second LLM verifies answer relevance
Confidence Scoring: Threshold-based answer or refusal

Every answer traceable to a specific document, page, and extraction method.

Annual Risk Exposure

The Cost of Hallucination

For a firm processing 100 DDQs per year, 50 questions each.

Hallucinated Answersper year
Standard RAG1,350
BackPro40
97% reduction
Regulatory Incidentspotential per year
Standard RAG135
BackPro4
97% reduction
Review Hours Savedannually with BackPro2,800hours of manual review eliminated per year

Based on benchmark hallucination rates applied to 5,000 annual questions. Regulatory incident estimate assumes 10% of fabricated answers carry compliance exposure.

Get the Full Benchmark

14 pages of methodology, detailed results, illustrative examples, and architectural analysis. Download the complete whitepaper.