Why Your Financial RAG Is Lying to You (And How to Fix It)

"Missing a single number from search can cost millions."
That was the exact framing Zerodha used in their L2 AI Engineer interview. They weren't being dramatic.

When you build RAG for general Q&A, a missed chunk is an annoyance. When you build it for financial documents — 10-Ks, earnings reports, balance sheets, fund factsheets — a missed chunk is a liability. The difference between "revenue grew 12%" and "revenue grew 1.2%" is not a semantic edge case. It is a wrong decision.

This post breaks down the production-grade search architecture that actually works for financial PDFs, layer by layer.

The Wrong Answer (And Why Everyone Starts There)

The most common mistake is treating financial RAG like document Q&A:

Parse PDF with PyPDF2
Chunk text at 512 tokens
Embed with text-embedding-ada-002
Cosine similarity search
Pass top-3 to LLM

This fails in financial contexts for four compounding reasons:

Tables become garbage. PyPDF2 linearizes table rows into undifferentiated text. "Revenue $4.2B Q3 2024" becomes spread across three chunks, each missing context.
Embeddings blur time. A query for "Q3 2024 revenue" semantically matches Q2 and Q1 chunks because embeddings don't distinguish fiscal periods with precision.
Exact numbers need exact matching. Semantic search will not reliably surface "$4.2B" when the embedding space treats numeric strings as weak signals.
No re-ranking means the wrong chunk wins. Cosine similarity returns the most related chunk — not the most correct one.

The fix is not a better embedding model. It is a layered retrieval architecture.

The Architecture: Five Layers That Each Catch What the Others Miss

PDF → [Chunking Strategy] → [Dual Index] → [Hybrid Search] → [RRF Fusion] → [Re-ranking] → LLM

Every layer has a specific job. Removing any one of them creates a failure mode you will hit in production.

Layer 1: Chunking Strategy

The most underestimated part of the entire pipeline.

Most engineers treat chunking as a preprocessing detail. It is actually the most consequential architectural decision, because no retrieval method can find a number that was split across chunks at ingestion time.

Tables: Extract, Don't Parse

Never chunk tables as raw text. Use pdfplumber or camelot to extract them as structured data. Each table row becomes an atomic chunk with its column headers preserved.

import pdfplumber

def extract_table_chunks(pdf_path: str) -> list[dict]:
    chunks = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            tables = page.extract_tables()
            for table in tables:
                headers = table[0]
                for row in table[1:]:
                    row_dict = dict(zip(headers, row))
                    chunks.append({
                        "text": f"{' | '.join(headers)}: {' | '.join(str(v) for v in row)}",
                        "type": "table_row",
                        "page": page_num + 1,
                        "structured": row_dict
                    })
    return chunks

A chunk that says "Period: Q3 2024 | Metric: Revenue | Value: $4.2B" is retrievable. A chunk that says "2024 4.2 3.8 12% growth operating" is not.

Prose: Sentence-Window Chunking with Heavy Overlap

For narrative text — MD&A sections, footnotes, risk factors — use sentence-aware chunking with 50% overlap. Financial context frequently spans two or three sentences. A hard boundary at 512 tokens will regularly cut a number from its qualifier.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=384,
    chunk_overlap=192,   # 50% overlap — not optional
)

Parent-Child Indexing

Index small chunks (128 tokens) for precision retrieval, but return their parent chunk (512 tokens) to the LLM for full context. A match on a precise sentence should surface the surrounding paragraph.

Layer 2: Dual Index — Dense + Sparse

This is the core of hybrid search. You maintain two separate indexes and query both in parallel.

Dense Index: Vector Search for Semantic Understanding

Vector search understands that "net income" and "profit after tax" are the same thing. It handles paraphrased queries and conceptual questions well.

from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=text
    )
    return response.data[0].embedding

Model recommendation: text-embedding-3-large for general use. For a dedicated financial deployment, FinBERT-based embeddings trained on financial corpora outperform general models on domain-specific queries.

Vector DB recommendation: Weaviate (built-in hybrid search), Pinecone, or pgvector if you are already on Postgres.

Sparse Index: BM25 for Exact Number Matching

BM25 is term-frequency-based keyword search. It does one thing that embeddings cannot: it matches exact strings reliably. "$4.2B", "EBITDA", "FY2024", "AAPL" — these are all better served by BM25 than by cosine similarity.

from rank_bm25 import BM25Okapi

def build_bm25_index(chunks: list[str]) -> BM25Okapi:
    tokenized = [chunk.split() for chunk in chunks]
    return BM25Okapi(tokenized)

def bm25_search(index: BM25Okapi, query: str, top_k: int = 20) -> list[int]:
    scores = index.get_scores(query.split())
    return sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]

You need both. Dense catches paraphrasing. Sparse catches exact numbers. Neither alone is sufficient.

Layer 3: Metadata Filtering

Filter before you search, not after.

Every chunk must carry structured metadata. You should never rely on semantic similarity to distinguish Q3 2024 from Q3 2023 — that distinction lives in metadata, not in embeddings.

metadata = {
    "doc_id":      "AAPL_10K_2024",
    "company":     "AAPL",
    "doc_type":    "10-K",
    "fiscal_year": 2024,
    "quarter":     "Q3",
    "section":     "income_statement",
    "chunk_type":  "table_row",   # or: "prose", "footnote", "header"
    "page":        12,
    "currency":    "USD",
}

Apply filters as a pre-search constraint, not a post-search filter:

# Apply metadata filter BEFORE vector search
results = vector_db.query(
    vector=embed(query),
    filter={
        "fiscal_year": {"$eq": 2024},
        "section": {"$in": ["income_statement", "balance_sheet"]},
    },
    top_k=20
)

This narrows the search space to relevant documents before any similarity computation happens. It is faster and more accurate than letting the retriever sort out document scope on its own.

Layer 4: RRF Fusion

Reciprocal Rank Fusion merges the ranked result lists from your dense and sparse indexes into a single unified ranking.

Why Not Weighted Score Averaging?

Because BM25 scores and cosine similarity scores live on incompatible scales. Normalizing them requires tuning that is brittle across different query types and document distributions. RRF is rank-based — it doesn't care about score magnitudes, only about positions in each ranking.

def reciprocal_rank_fusion(
    rankings: list[list[str]],
    k: int = 60
) -> list[str]:
    """
    rankings: list of ranked doc ID lists (one per retrieval method)
    k: smoothing constant — 60 is the standard default
    """
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank + 1)
    return sorted(scores.keys(), key=lambda d: scores[d], reverse=True)


# Usage
dense_results  = dense_search(query, top_k=20)   # list of doc IDs
sparse_results = bm25_search(index, query, top_k=20)

fused = reciprocal_rank_fusion([dense_results, sparse_results])
top_20_candidates = fused[:20]

A document that ranks 3rd in dense search and 5th in BM25 will score higher than one that ranks 1st in dense but doesn't appear in BM25 at all. This is exactly the right behavior for financial queries where you want agreement across retrieval methods.

Layer 5: Cross-Encoder Re-Ranking

This is where accuracy is made or lost.

Bi-encoders (the models used in vector search) embed your query and each document independently. They produce good general rankings but miss fine-grained numerical relevance — they have no mechanism to compare the query and document together at inference time.

A cross-encoder takes the query and document as a concatenated pair and scores them with full attention over both. It is more expensive, but you only run it on 20 candidates, not your entire corpus.

from sentence_transformers import CrossEncoder

# BAAI/bge-reranker-large is best-in-class for open-source
reranker = CrossEncoder("BAAI/bge-reranker-large")

def rerank(
    query: str,
    candidates: list[dict],
    top_k: int = 5
) -> list[dict]:
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True
    )
    return [doc for doc, _ in ranked[:top_k]]


final_context = rerank(query, top_20_candidates, top_k=5)

Pass only these 5 chunks to the LLM. You retrieved 20 with hybrid search to give the re-ranker enough signal. You return 5 to keep the context window clean and focused.

The Hallucination Guard

One final layer that belongs in every production financial RAG system: verify that every number in the LLM response exists in the retrieved chunks before returning it to the user.

import re

class GroundingError(Exception):
    pass

def verify_grounding(answer: str, source_chunks: list[str]) -> bool:
    """
    Raise if any number in the answer cannot be found in the source chunks.
    """
    number_pattern = r'\$?[\d,]+\.?\d*\s*[BMKbmk]?'

    answer_numbers = set(re.findall(number_pattern, answer))
    source_text    = " ".join(source_chunks)
    source_numbers = set(re.findall(number_pattern, source_text))

    ungrounded = answer_numbers - source_numbers
    if ungrounded:
        raise GroundingError(
            f"Answer contains numbers not found in sources: {ungrounded}"
        )
    return True

This is not a complete solution — a sufficiently clever hallucination can still slip through — but it catches the most common failure mode: the LLM extrapolating or interpolating a number that was never in the retrieved context.

Full Stack Reference

Component	Recommendation	Notes
PDF parsing	`pdfplumber` + `camelot`	One for text, one for tables
Chunking	LlamaIndex `SentenceSplitter` + custom table chunker	Never use character-based splitting
Vector DB	Weaviate or Pinecone	Weaviate has native hybrid search
Embeddings	`text-embedding-3-large` or FinBERT	Domain-tuned wins on financial corpora
Sparse search	Elasticsearch / OpenSearch or `rank_bm25`	BM25 is non-negotiable for financials
Re-ranker	`BAAI/bge-reranker-large`	Best open-source cross-encoder available
Orchestration	LlamaIndex or LangChain + custom fusion	Write your own RRF — it's 10 lines
Observability	LangSmith or Arize Phoenix	Trace every retrieval decision

Why Each Method Alone Fails

Method	What it misses
Dense only	Blurs fiscal periods; misses exact figures; "Q3 2024" ≈ "Q3 2023" in embedding space
BM25 only	No semantic generalization; "net income" won't match "profit after tax"
No table extraction	Splits column-row relationships; numbers lose their context labels
No re-ranking	Cosine similarity returns most-related, not most-correct
No grounding check	LLM interpolates numbers not in context; caught too late

The One-Line Answer

If you're in an interview and need to distill this:

"Hybrid BM25 + dense retrieval fused via RRF, with metadata pre-filtering, table-aware chunking, cross-encoder re-ranking, and a post-retrieval grounding verifier — because in financial RAG, missing one number isn't a UX issue, it's a liability."

That sentence contains every architectural decision that matters. Each word is load-bearing.

Closing Thought

The reason financial RAG is hard is not that language models are bad at numbers. It is that retrieval accuracy at the 99th percentile requires five independent systems working in concert, each designed to catch what the others miss.

General RAG tolerates a miss. Financial RAG does not.

Build accordingly.

Written by an engineer who has thought too hard about why cosine similarity fails on fiscal quarters.

Blog

Why Your Financial RAG Is Lying to You (And How to Fix It)

Why Your Financial RAG Is Lying to You (And How to Fix It)

The Wrong Answer (And Why Everyone Starts There)

The Architecture: Five Layers That Each Catch What the Others Miss

Layer 1: Chunking Strategy

Tables: Extract, Don't Parse

Prose: Sentence-Window Chunking with Heavy Overlap

Parent-Child Indexing

Layer 2: Dual Index — Dense + Sparse

Dense Index: Vector Search for Semantic Understanding

Sparse Index: BM25 for Exact Number Matching

Layer 3: Metadata Filtering

Layer 4: RRF Fusion

Why Not Weighted Score Averaging?

Layer 5: Cross-Encoder Re-Ranking

The Hallucination Guard

Full Stack Reference

Why Each Method Alone Fails

The One-Line Answer

Closing Thought

Keep reading

How to Convert PDF to JPG or PNG — Free, High Quality

How to Password-Protect a PDF — Encryption Explained Simply

How to Redact a PDF Properly — Remove Sensitive Information for Good