Improving citation accuracy and reducing hallucinations in custom Parent-Child RAG pipeline (Gemma3:4B + FAISS+BM25 + Cross-encoder reranker)

IchNarA · May 4, 2026, 5:38pm

Hi everyone,

I’m building a local study assistant for university textbooks (mainly PDFs) using a fairly sophisticated RAG stack, but I’m struggling with two persistent issues that significantly hurt user experience:

Wrong / inconsistent page citations — The model often cites pages that don’t actually contain the claimed information, or the right sidebar shows different pages than what the model referenced in the answer.
Occasional hallucinations + repetition — Sometimes the model starts repeating words/phrases mid-sentence or adds plausible but ungrounded information.

My current architecture:

Document processing: MinerU (quality mode) + PyMuPDF (fast mode) → Markdown with markers
Chunking: Custom ParentChildChunker using MarkdownHeaderTextSplitter + RecursiveCharacterTextSplitter
- Parents: larger sections (~300-2400 chars)
- Children: ~500 char chunks with overlap for retrieval
Vector store: FAISS (multilingual-e5-base) + BM25 hybrid with RRF fusion
Reranking: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
Context building: Retrieve → rerank → parent expansion (using ParentStore) → limited to ~9000 chars
Generation: LangGraph pipeline (rewrite → retrieve → rerank → expand → generate) with gemma3:4b (Ollama), temp=0.0-0.1, repeat_penalty=1.15
Main problems I see:
Parent vs Child mismatch: When I expand to parents for better context, the source_docs passed to the UI still come from child chunks → citation filtering fails or shows wrong pages.

Questions:

Where is the biggest weakness in this setup — chunking strategy, parent expansion logic, citation post-processing, or the prompt itself?

Any insights, similar experiences, or suggested improvements would be greatly appreciated. I’m happy to share whole python files that contains the logic (document processor.py, rag_graph.py,vector_store.py).

simon-langchain · May 5, 2026, 3:35pm

Hey @IchNarA, your diagnosis looks right, and the two biggest weak points are the parent expansion logic and the generation setup.

For citations, when you expand to parent chunks the child’s page metadata gets lost. The fix is to keep the child chunk’s metadata as the citation reference and only use the parent text for context in generation.

For hallucinations and repetition, 9000 chars is a lot for a 4B model. Try cutting context down to around 4000-5000 chars and add a prompt escape hatch like “if the answer isn’t in the provided text, say so”. At temp=0, hallucinations usually mean the retrieved context doesn’t actually contain the answer and the model fills the gap.

Your chunking strategy and retrieval setup look solid overall!

RAGPrep · June 1, 2026, 8:10pm

The Parent-Child setup with FAISS+BM25+cross-encoder reranking is a solid architecture and your component choices are right. When citation accuracy is degrading despite a well-built retrieval stack, the failure usually isn’t in retrieval — it’s in the relationship between what’s retrieved and what the small model can actually ground against.

A few specific things worth checking with Gemma3:4B as the generator: Parent chunks are likely too large for a 4B model to ground against reliably. The Parent-Child pattern works by retrieving small precise children and expanding to larger parents for context. But a 4B model has a much harder time maintaining citation discipline across a long parent chunk than a 70B+ model does. The grounding fidelity drops as parent length grows. If your parents are 2000+ tokens, try halving them and see if citation accuracy improves before changing anything else in the pipeline.

Cross-encoder reranking gives you precision but not faithfulness. Reranking surfaces the most query-relevant chunks. It doesn’t filter out chunks that are relevant-looking but informationally incomplete. A chunk can rank high on relevance and still lack the specific fact the model needs to ground a citation. The model then synthesises plausible-sounding text that reads grounded but isn’t.

Chunk quality at ingestion compounds the problem with small models. Large models often paper over chunk quality issues by inferring missing context. Small models can’t. A chunk that splits a key fact across a boundary, or that includes boilerplate that dilutes the actual signal, will produce a fabricated citation more reliably on a 4B model than on a 70B model. The same corpus that’s “fine” on a larger model will hallucinate visibly on Gemma3:4B.

The diagnostic I’d run before tuning further:
Pull 50 cases where the citation was wrong or fabricated.
Look at the chunks that were actually in context when the model generated the bad citation.
Categorise the failure: was the right chunk retrieved but not used (generation problem), was a wrong chunk retrieved (retrieval problem), or was a chunk retrieved that contained the topic but not the specific fact (chunk quality problem)?

In my experience the third category is the largest and the least addressed. People tune retrieval and reranking endlessly when the actual fix is to either re-chunk the source documents at finer granularity or to enrich chunks with structural metadata (section headers, source document, paragraph context) so the model has anchors to ground against.

On the Parent-Child specifically: consider adding source attribution metadata to each chunk at ingestion (document ID, page/section, last-modified timestamp) and including that metadata in the context passed to the generator. Small models follow structural prompts much more reliably than they follow implicit grounding. “Generate the answer and cite only from the chunks below, using the document_id field” outperforms “ground your answer in the retrieved chunks” by a wide margin on sub-7B models.

VLSiddarth · June 15, 2026, 1:38pm

Hey @IchNarA,

Simon nailed the core citation fix — keep child metadata as the citation anchor, use parent text only for generation context. Let me add a few specific things from having debugged this exact stack.

On the Parent-Child mismatch specifically:

The problem is happening in your expand step. When you retrieve child chunks and then fetch parents from ParentStore, you’re passing source_docs from the parent objects downstream — but the parent object has broader page metadata (or none). The fix is a two-dict approach:

python

def expand_with_citation_anchor(child_docs, parent_store):
    results = []
    for child in child_docs:
        parent = parent_store.get(child.metadata["parent_id"])
        # Use parent TEXT for generation context
        # But preserve child METADATA as the citation source of truth
        results.append(Document(
            page_content=parent.page_content,  # full context
            metadata={
                **child.metadata,              # child page/location stays
                "citation_page": child.metadata["page"],
                "context_source": "parent_expanded"
            }
        ))
    return results

Then your prompt only ever references citation_page from metadata, not whatever page the parent spans.

On the hallucinations + repetition with gemma3:4b:

Simon is right that 9000 chars is too much for a 4B model, but there’s a more specific issue — the order of your context matters as much as the length. Gemma3 at 4B has a recency bias and an attention cliff around 3500 tokens. Anything past that gets attended to poorly.

Two things that actually move the needle:

Put the most relevant child chunk first, not last. Your RRF fusion score gives you a ranking — use it. The top-ranked chunk should be at position 0 in your context string, not buried.
Add a staleness signal before generation. This is the less obvious one. Hallucinations in study assistant RAG usually come from one of two sources: context mismatch (Simon’s point) OR the retrieved chunks are outdated relative to the document version being studied. If you indexed v1 of a textbook and a student has v2, the page numbers shift and the model hallucinates “citations” that are actually real content — just from the wrong edition.

On your chunking:

Your 300-2400 char parent range is wide. A 2400 char parent is ~600 tokens — that’s a lot of context for one “section.” If you’re studying textbooks, chapter sections tend to be more coherent units. Try capping parents at ~800 chars and see if citation precision improves. Smaller parents mean less ambiguity about which page a fact came from.

The weakness hierarchy in your specific setup:

Parent expansion citation propagation — biggest issue, fix this first (code above)
Context length for 4B model — cut to 4000 chars max, order by relevance score
Document versioning — if your textbooks ever get re-indexed, stale chunks silently corrupt citations

For point 3, if you want a turnkey freshness signal on your retrieved chunks, Knowledge Universe API computes a decay score per source that flags exactly this — useful if you’re indexing sources beyond your local PDFs.

Happy to look at your rag_graph.py if you share it — the LangGraph rewrite → retrieve → generate sequence has a specific failure mode when the rewrite step changes terminology and then the retrieval misses parent chunks that would have matched the original query.

Tanishq1030 · June 15, 2026, 2:20pm

Hey @IchNarA, @simon-langchain

The Parent-Child + hybrid retrieval + reranker setup is quite advanced, yet citation hallucinations on small models like Gemma3:4B are still very common.

I’ve seen similar issues in production RAG systems. A few observations from my own experiments:

Parent expansion is often the silent killer → Even if you keep the child metadata for citation, the model gets confused when the context suddenly jumps to a much larger parent chunk. The 4B model struggles to reliably map which specific sentences support the claim.
Metadata anchoring helps more than expected → Adding explicit structural metadata (section title, page number, paragraph ID, document title) in the prompt makes a surprising difference with smaller models. They follow “Cite only using the page field below” much better than vague grounding instructions.
Context length vs faithfulness trade-off → 9000 chars is definitely too much for Gemma3:4B. I’ve had better results capping it at ~4500-5500 tokens and being stricter with reranking (e.g. only top 4-5 chunks after cross-encoder).

One thing I’m curious about from the community:

When you have a case where the correct chunk was retrieved and put in context, but the model still hallucinates a wrong page or fabricates details → what have people found most effective?

Better prompting / few-shot examples?
Post-generation citation verification step?
Finer chunking + better overlap strategy?
Switching to a stronger but still local model (e.g. Llama-3.1-8B or Qwen2.5-14B)?

Would love to hear what’s actually moving the needle for people running local RAG assistants with small models.

Topic		Replies	Views
Help with local RAG pipeline – poor retrieval quality, wrong page numbers LangSmith Product Help python-help	0	89	April 20, 2026
Comment rendre dynamique un RAG documentaire FastAPI/LangChain avec PGVector pour qu’il retrouve les bons chunks sur tout type de document? LangChain python-help	2	45	July 7, 2026
DOC Rag Sample clarification LangSmith Product Help intro-to-langgraph , product-feedback	0	252	August 20, 2025
Beginner questions & Troubleshooting: RAG Best practices for Mail Automation Talking Shop js-help	1	63	June 1, 2026
LangChain Agents: stream_mode="messages" intermittently emits cumulative AIMessageChunk (large duplicate text mid-response) LangChain python-help	6	628	November 21, 2025

Improving citation accuracy and reducing hallucinations in custom Parent-Child RAG pipeline (Gemma3:4B + FAISS+BM25 + Cross-encoder reranker)

My current architecture:

Related topics