CONTROLLED BENCHMARK

96.7% Accuracy.
0.8% Hallucination Rate.

We ran a controlled benchmark with 120 questions across real financial services documents, comparing BackPro against Standard RAG and a Plain LLM. The results speak for themselves.

97% fewer hallucinations vs RAG

3.2× higher accuracy

98.3% proper refusal rate

97.9 composite reliability score

Download Full Whitepaper Book a Demo

Looking for a non-technical summary? Read the version for firms →

Scroll

Why Hallucinations Are a Material Risk

In consumer apps, a hallucination is an inconvenience. In financial services, it's a regulatory breach waiting to happen.

Regulatory Risk

Fabricated Compliance Data

Fabricated AI Output

"APRA CPS 234 cybersecurity penetration testing is conducted quarterly by an independent assessor."

Fabricated

No such testing exists. Misleading conduct under the Corporations Act.

Data Integrity

Invented Financial Metrics

Fabricated AI Output

"The fund's Sharpe ratio for Q3 2024 was 1.42, reflecting strong risk-adjusted returns."

Fabricated

No Sharpe ratio data exists in any uploaded document. Entirely fabricated.

Audit Failure

Phantom Policies

Fabricated AI Output

"Board-approved ESG framework adopted in Q2 2024 governs all investment screening."

Fabricated

This policy was never created. Audit trail leads nowhere.

Key Finding

Standard RAG Actually Makes Hallucinations Worse

Our benchmark found that Standard RAG increased hallucination rates on answerable questions from 11.7% (Plain LLM) to 25.0%. Retrieved context that is superficially relevant but not precisely on-topic encourages the model to fabricate details that blend real and imagined information.

Hallucination Rate

11.7%Plain LLM

25.0%Standard RAG

0.8%BackPro

Benchmark Results

120 questions. 3 systems. Same LLM, same documents, same infrastructure. Only the retrieval pipeline differs.

Answerable Questions: Factual Accuracy

System	Accuracy	Hallucination Rate
Plain LLM (No Retrieval)	28.3%	11.7%
Standard RAG	30.0%	25.0%
BackPro	96.7%	1.7%

Unanswerable Questions: Refusal Integrity

The most critical test for enterprise deployment: when a system cannot answer from available documents, the only correct behaviour is to say so.

System	Proper Refusal Rate	Hallucination Rate
Plain LLM	70.0%	20.0%
Standard RAG	55.0%	31.7%
BackPro	98.3%	0.0%

Overall Hallucination Rate: 0.8%

1 borderline case across all 120 questions, a 97% reduction compared to Standard RAG (28.3%)

See the Difference

Real questions from our benchmark. See how each system responds and where the others fabricate.

ANSWERABLE QUESTION

“How is performance monitored and by whom?”

Plain LLM: Hallucinated

“Performance is typically monitored by the portfolio manager and risk team using Bloomberg PORT and FactSet...” Fabricated tools not used by the firm.

Standard RAG: Incorrect

“The compliance team oversees performance measurement...” Retrieved wrong chunk about compliance, not performance.

BackPro: Correct

“Performance is monitored daily by the investment team using internal attribution tools. The CIO conducts weekly reviews against the benchmark, and the Board receives quarterly reports from an independent third-party provider.”

UNANSWERABLE QUESTION

“What was the fund's Sharpe ratio for Q3 2024?”

No Sharpe ratio data for any period exists in the document corpus.

Plain LLM: Hallucinated

“The fund's Sharpe ratio for Q3 2024 was approximately 1.42, reflecting strong risk-adjusted returns...” Entirely fabricated.

Standard RAG: Hallucinated

“Based on the available performance data, the Sharpe ratio was 0.87 for the September quarter...” Fabricated from an unrelated performance chunk.

BackPro: Proper Refusal

“I was unable to find information about the fund's Sharpe ratio for Q3 2024 in the available documents. This specific metric does not appear in any of the uploaded DDQs or compliance documents.”

Why BackPro Performs Better

The performance differential is architectural, not incremental. BackPro's pipeline addresses each failure mode that standard approaches miss.

Standard RAG

Embed Query

Vector Search (top-5)

Inject Chunks

LLM Generate

Return Answer

No verification. No confidence scoring. No document-aware retrieval.

BackPro Pipeline

Hybrid Retrieval: Text + visual embeddings (ColPali)

DDQ QA Matching: Pre-verified question–answer pairs

Smart Extraction: Document-structure-guided extraction

Relevance Verification: Second LLM verifies answer relevance

Confidence Scoring: Threshold-based answer or refusal

Every answer traceable to a specific document, page, and extraction method.

Annual Risk Exposure

The Cost of Hallucination

For a firm processing 100 DDQs per year, 50 questions each.

Hallucinated Answersper year

Standard RAG1,350

BackPro40

97% reduction

Regulatory Incidentspotential per year

Standard RAG135

BackPro4

97% reduction

Review Hours Savedannually with BackPro2,800hours of manual review eliminated per year

Based on benchmark hallucination rates applied to 5,000 annual questions. Regulatory incident estimate assumes 10% of fabricated answers carry compliance exposure.

Get the Full Benchmark

14 pages of methodology, detailed results, illustrative examples, and architectural analysis. Download the complete whitepaper.

Download Whitepaper Book a Demo

96.7% Accuracy.0.8% Hallucination Rate.

Why Hallucinations Are a Material Risk

Fabricated Compliance Data

Invented Financial Metrics

Phantom Policies

Standard RAG Actually Makes Hallucinations Worse

Benchmark Results

Answerable Questions: Factual Accuracy

Unanswerable Questions: Refusal Integrity

See the Difference

Why BackPro Performs Better

Standard RAG

BackPro Pipeline

The Cost of Hallucination

Get the Full Benchmark

96.7% Accuracy.
0.8% Hallucination Rate.