Hi everyone,
I’m building a local study assistant for university textbooks (mainly PDFs) using a fairly sophisticated RAG stack, but I’m struggling with two persistent issues that significantly hurt user experience:
-
Wrong / inconsistent page citations — The model often cites pages that don’t actually contain the claimed information, or the right sidebar shows different pages than what the model referenced in the answer.
-
Occasional hallucinations + repetition — Sometimes the model starts repeating words/phrases mid-sentence or adds plausible but ungrounded information.
My current architecture:
-
Document processing: MinerU (quality mode) + PyMuPDF (fast mode) → Markdown with markers
-
Chunking: Custom ParentChildChunker using MarkdownHeaderTextSplitter + RecursiveCharacterTextSplitter
-
Vector store: FAISS (multilingual-e5-base) + BM25 hybrid with RRF fusion
-
Reranking: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
-
Context building: Retrieve → rerank → parent expansion (using ParentStore) → limited to ~9000 chars
-
Generation: LangGraph pipeline (rewrite → retrieve → rerank → expand → generate) with gemma3:4b (Ollama), temp=0.0-0.1, repeat_penalty=1.15
-
Main problems I see:
-
Parent vs Child mismatch: When I expand to parents for better context, the source_docs passed to the UI still come from child chunks → citation filtering fails or shows wrong pages.
Questions:
- Where is the biggest weakness in this setup — chunking strategy, parent expansion logic, citation post-processing, or the prompt itself?
Any insights, similar experiences, or suggested improvements would be greatly appreciated. I’m happy to share whole python files that contains the logic (document processor.py, rag_graph.py,vector_store.py).
Hey @IchNarA, your diagnosis looks right, and the two biggest weak points are the parent expansion logic and the generation setup.
For citations, when you expand to parent chunks the child’s page metadata gets lost. The fix is to keep the child chunk’s metadata as the citation reference and only use the parent text for context in generation.
For hallucinations and repetition, 9000 chars is a lot for a 4B model. Try cutting context down to around 4000-5000 chars and add a prompt escape hatch like “if the answer isn’t in the provided text, say so”. At temp=0, hallucinations usually mean the retrieved context doesn’t actually contain the answer and the model fills the gap.
Your chunking strategy and retrieval setup look solid overall!
The Parent-Child setup with FAISS+BM25+cross-encoder reranking is a solid architecture and your component choices are right. When citation accuracy is degrading despite a well-built retrieval stack, the failure usually isn’t in retrieval — it’s in the relationship between what’s retrieved and what the small model can actually ground against.
A few specific things worth checking with Gemma3:4B as the generator: Parent chunks are likely too large for a 4B model to ground against reliably. The Parent-Child pattern works by retrieving small precise children and expanding to larger parents for context. But a 4B model has a much harder time maintaining citation discipline across a long parent chunk than a 70B+ model does. The grounding fidelity drops as parent length grows. If your parents are 2000+ tokens, try halving them and see if citation accuracy improves before changing anything else in the pipeline.
Cross-encoder reranking gives you precision but not faithfulness. Reranking surfaces the most query-relevant chunks. It doesn’t filter out chunks that are relevant-looking but informationally incomplete. A chunk can rank high on relevance and still lack the specific fact the model needs to ground a citation. The model then synthesises plausible-sounding text that reads grounded but isn’t.
Chunk quality at ingestion compounds the problem with small models. Large models often paper over chunk quality issues by inferring missing context. Small models can’t. A chunk that splits a key fact across a boundary, or that includes boilerplate that dilutes the actual signal, will produce a fabricated citation more reliably on a 4B model than on a 70B model. The same corpus that’s “fine” on a larger model will hallucinate visibly on Gemma3:4B.
The diagnostic I’d run before tuning further:
Pull 50 cases where the citation was wrong or fabricated.
Look at the chunks that were actually in context when the model generated the bad citation.
Categorise the failure: was the right chunk retrieved but not used (generation problem), was a wrong chunk retrieved (retrieval problem), or was a chunk retrieved that contained the topic but not the specific fact (chunk quality problem)?
In my experience the third category is the largest and the least addressed. People tune retrieval and reranking endlessly when the actual fix is to either re-chunk the source documents at finer granularity or to enrich chunks with structural metadata (section headers, source document, paragraph context) so the model has anchors to ground against.
On the Parent-Child specifically: consider adding source attribution metadata to each chunk at ingestion (document ID, page/section, last-modified timestamp) and including that metadata in the context passed to the generator. Small models follow structural prompts much more reliably than they follow implicit grounding. “Generate the answer and cite only from the chunks below, using the document_id field” outperforms “ground your answer in the retrieved chunks” by a wide margin on sub-7B models.