Improving citation accuracy and reducing hallucinations in custom Parent-Child RAG pipeline (Gemma3:4B + FAISS+BM25 + Cross-encoder reranker)

Hi everyone,

I’m building a local study assistant for university textbooks (mainly PDFs) using a fairly sophisticated RAG stack, but I’m struggling with two persistent issues that significantly hurt user experience:

  1. Wrong / inconsistent page citations — The model often cites pages that don’t actually contain the claimed information, or the right sidebar shows different pages than what the model referenced in the answer.

  2. Occasional hallucinations + repetition — Sometimes the model starts repeating words/phrases mid-sentence or adds plausible but ungrounded information.

My current architecture:

  • Document processing: MinerU (quality mode) + PyMuPDF (fast mode) → Markdown with markers

  • Chunking: Custom ParentChildChunker using MarkdownHeaderTextSplitter + RecursiveCharacterTextSplitter

    • Parents: larger sections (~300-2400 chars)

    • Children: ~500 char chunks with overlap for retrieval

  • Vector store: FAISS (multilingual-e5-base) + BM25 hybrid with RRF fusion

  • Reranking: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

  • Context building: Retrieve → rerank → parent expansion (using ParentStore) → limited to ~9000 chars

  • Generation: LangGraph pipeline (rewrite → retrieve → rerank → expand → generate) with gemma3:4b (Ollama), temp=0.0-0.1, repeat_penalty=1.15

  • Main problems I see:

  • Parent vs Child mismatch: When I expand to parents for better context, the source_docs passed to the UI still come from child chunks → citation filtering fails or shows wrong pages.

Questions:

  1. Where is the biggest weakness in this setup — chunking strategy, parent expansion logic, citation post-processing, or the prompt itself?

Any insights, similar experiences, or suggested improvements would be greatly appreciated. I’m happy to share whole python files that contains the logic (document processor.py, rag_graph.py,vector_store.py).

Hey @IchNarA, your diagnosis looks right, and the two biggest weak points are the parent expansion logic and the generation setup.

For citations, when you expand to parent chunks the child’s page metadata gets lost. The fix is to keep the child chunk’s metadata as the citation reference and only use the parent text for context in generation.

For hallucinations and repetition, 9000 chars is a lot for a 4B model. Try cutting context down to around 4000-5000 chars and add a prompt escape hatch like “if the answer isn’t in the provided text, say so”. At temp=0, hallucinations usually mean the retrieved context doesn’t actually contain the answer and the model fills the gap.

Your chunking strategy and retrieval setup look solid overall!