Hi everyone,
I’m building a fully local RAG application in Python (no cloud APIs) and running into several persistent issues. I’ll pin the full source below. Would really appreciate any advice from people who’ve dealt with similar setups.
-–
### Stack overview
- **LLM:** Qwen2.5:7b via Ollama
- **Embeddings:** `intfloat/multilingual-e5-base` (HuggingFace, offline)
- **Vector store:** FAISS (child chunks) + BM25 (via LangChain)
- **Reranker:** `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1`
- **Chunking:** Parent-child strategy – MarkdownHeaderTextSplitter for parents, RecursiveCharacterTextSplitter for children
- **PDF extraction:** pymupdf4llm (fast) or MinerU (slow, for LaTeX-heavy docs)
- **Pipeline:** LangGraph with nodes: pre-retrieval → hybrid retrieve → rerank → build context → evaluate evidence → generate
- **UI:** Streamlit
Documents are primarily English-language academic PDFs (e.g. Montgomery’s Design and Analysis of Experiments, 720 pages). User queries are always in Slovak.
-–
### Problem 1 – Cross-lingual retrieval failure (SK query → EN document)
This is the most painful issue. When a user asks *“čo to je replikácia?”* (“what is replication?”), the FAISS similarity search returns completely irrelevant chunks (confidence ~0.045) even though the word “replication” appears many times in the document.
My current workaround:
-
Detect document language via `langdetect`
-
If EN document detected, translate the SK query to EN using the LLM before retrieval
-
Use the translated query in both FAISS and BM25
This partially works but is inconsistent – sometimes the LLM translates to “What is replication?”, sometimes it doesn’t, so results are non-deterministic even at temperature=0.
I also added a rescue BM25 search in `evaluate_evidence` as a last resort, which helps but retrieves chunks from wrong pages (e.g. page 424 instead of page 13 where the definition actually is).
**Questions:**
- Is `multilingual-e5-base` simply too weak for SK↔EN cross-lingual retrieval? Should I switch to a different model (e.g. `intfloat/multilingual-e5-large`, `BAAI/bge-m3`, or a dedicated cross-lingual model)?
- Is there a better approach than LLM-based query translation? I considered expanding the index with translated chunks but haven’t implemented it yet.
- Any experience with `mmarco-mMiniLMv2` reranker for non-English content? I suspect it’s poorly calibrated for Slovak and the confidence scores are systematically too low (~0.04 instead of expected ~0.3+).
-–
### Problem 2 – Wrong page numbers in cited sources
My chunker injects `` markers into the markdown before chunking, then detects which page each chunk belongs to by matching text probes against page texts. The logic works reasonably for single-page chunks but breaks in two cases:
-
**Large parents spanning multiple pages** – when `_split_large` splits them, all resulting chunks inherit the original parent’s page metadata instead of getting re-detected page numbers.
-
**Dense mathematical/formula-heavy pages** – probes (min 15 chars) often don’t match because MinerU reformats LaTeX and the text doesn’t align with the original page content.
The cited pages are sometimes off by 5–15 pages which makes source verification impossible.
**Questions:**
- Is there a more reliable strategy for page attribution in RAG chunking?
- Would embedding page number tokens directly into chunk text help BM25/FAISS associate chunks with correct pages?
-–
### Problem 3 – Poor Slovak output quality
The LLM (Qwen2.5:7b) receives English context and is instructed via system prompt to answer in Slovak. The output Slovak is grammatically broken – literal word-by-word translations, wrong declensions, invented compound words (e.g. “olejová hniloba” for “oil quench”, “oholenie vzorku” for “quenching a specimen”).
Current system prompt instructs:
- Always answer in Slovak
- Don’t translate literally, explain in your own words
- Keep English technical terms in parentheses if unsure
This helps somewhat but the quality is still poor for technical content.
**Questions:**
- Is Qwen2.5:7b simply not good enough for EN→SK technical translation in context? Would a larger model (Qwen2.5:14b, gemma3:12b) make a significant difference?
- Has anyone tried a two-step approach: generate answer in English first, then translate to Slovak as a second LLM call?
- Any prompt engineering tricks that worked for you for multilingual RAG output?
-–
### Problem 4 – Reranker confidence threshold causes false abstentions
The cross-encoder produces confidence scores around 0.04–0.07 for relevant Slovak/English pairs. My threshold is set to 0.15 (already lowered from original 0.32). At confidence below threshold, the system returns “not found in documents” even when the correct answer is there.
I added a keyword override (check if query words appear in context docs) but it’s unreliable for cross-lingual queries because Slovak words don’t match English document text.
### Code
*(pinning below)*
- `document_processor.py` – PDF extraction + parent-child chunking: https://pastebin.com/m8egQ7HY
- `vector_store.py` – FAISS + BM25 + E5Embeddings wrapper: https://pastebin.com/4kkhsg8M
- `rag_graph.py` – full LangGraph pipeline: https://pastebin.com/P31pGiie
- `parent_store.py` – https://pastebin.com/xwNeAMnE