I'm building a multilingual chatbot (Italian, English, Spanish, etc.) that acts as a travel consultant for a specific city, using semantic caching with a vector database to reduce LLM API costs and latency.
## Current Architecture
Cached responses are stored with embeddings and language metadata:
```python
# English entry
{
"embedding": [0.23, 0.45, ...],
"metadata": {
"question": "what are the best restaurants?",
"answer": "The best restaurants are: Trattoria Roma, Pizzeria Napoli...",
"language": "en"
}
}
# Italian entry
{
"embedding": [0.24, 0.46, ...],
"metadata": {
"question": "quali sono i migliori ristoranti?",
"answer": "I migliori ristoranti sono: Trattoria Roma, Pizzeria Napoli...",
"language": "it"
}
}
The Problem
Since embeddings are semantic, “best restaurants” (English) and “migliori ristoranti” (Italian) have very similar vectors. Without proper filtering, an Italian user asking “ristoranti” might get the cached English response.
My current approach: Filter vector search by language metadata:
Hello @emylee , so if you want this to be reliable, treat language as a strict cache boundary, not just something you infer from embedding similarity. A practical setup is:
Store language in metadata and always filter on it during cache lookup.
Add a similarity score threshold so short/ambiguous inputs return a cache miss instead of the wrong-language answer.
For very short messages, run a lightweight language router that returns structured output like language + confidence(this ensures you always get the same structure)
If confidence is low, skip semantic cache for that turn (or only check the user’s last known language bucket).
Save a small per-user language preference (for example by WhatsApp phone ID) so future short messages are easier to route, without needing full chat history.
Here is how you could structure a flow:
Classify {language, confidence}.
If confidence is high (for example >= 0.8), query semantic cache with filter={"language": language} and your score threshold.
If confidence is low, treat it as a miss and call the LLM.
After generating a fresh answer, write it back to cache with language, city, and a normalized intent key.
So LangChain vector stores support metadata filtering thus retrievers support score-threshold retrieval. Let me know if you have further questions.
Language should be treated as part of the cache semantics. Multilingual queries should not be considered cache-equivalent unless you explicitly plan to translate the response. As a result, the cache key must account for both the user’s intent and the response language, even when the language cannot be determined with certainty.
Recommended Architecture: Language-agnostic intent cache with language-specific rendering
This approach provides the most robust and scalable long-term solution.
Use a multilingual embedding model (e.g., text-embedding-3-large, LaBSE, E5-multilingual) so that semantically equivalent queries across languages map to the same intent.
As a result:
“best restaurants”
“migliori ristoranti”
“mejores restaurantes”
all resolve to the same intent cache entry.
Language-specific response rendering
Once the intent is resolved, generate or retrieve a cached response in the requested language. This keeps retrieval language-agnostic while ensuring the final answer is always language-correct.