Semantic caching strategy for multilingual chatbot: how to handle language-specific cache entries?


I'm building a multilingual chatbot (Italian, English, Spanish, etc.) that acts as a travel consultant for a specific city, using semantic caching with a vector database to reduce LLM API costs and latency.

## Current Architecture

Cached responses are stored with embeddings and language metadata:

```python
# English entry
{
  "embedding": [0.23, 0.45, ...],
  "metadata": {
    "question": "what are the best restaurants?",
    "answer": "The best restaurants are: Trattoria Roma, Pizzeria Napoli...",
    "language": "en"
  }
}

# Italian entry
{
  "embedding": [0.24, 0.46, ...],
  "metadata": {
    "question": "quali sono i migliori ristoranti?",
    "answer": "I migliori ristoranti sono: Trattoria Roma, Pizzeria Napoli...",
    "language": "it"
  }
}


The Problem

Since embeddings are semantic, “best restaurants” (English) and “migliori ristoranti” (Italian) have very similar vectors. Without proper filtering, an Italian user asking “ristoranti” might get the cached English response.

My current approach: Filter vector search by language metadata:

results = vector_db.query(
    embedding=embed(user_message),
    filter={"language": user_language},
    top_k=1
)


This works IF I can reliably detect the user’s language. But:

  • Messages are often very short (“museums”, “metro”, “parking”)

  • Language detection libraries (langdetect, fastText) are unreliable with < 20 characters

  • The chatbot is stateless (no conversation history for caching efficiency)

  • Platform is WhatsApp (no browser headers available)

    What’s the recommended semantic caching strategy for multilingual chatbots when user language cannot be reliably detected from short messages?

Hello @emylee , so if you want this to be reliable, treat language as a strict cache boundary, not just something you infer from embedding similarity. A practical setup is:

  1. Store language in metadata and always filter on it during cache lookup.
  2. Add a similarity score threshold so short/ambiguous inputs return a cache miss instead of the wrong-language answer.
  3. For very short messages, run a lightweight language router that returns structured output like language + confidence(this ensures you always get the same structure)
  4. If confidence is low, skip semantic cache for that turn (or only check the user’s last known language bucket).
  5. Save a small per-user language preference (for example by WhatsApp phone ID) so future short messages are easier to route, without needing full chat history.

Here is how you could structure a flow:

  1. Classify {language, confidence}.
  2. If confidence is high (for example >= 0.8), query semantic cache with filter={"language": language} and your score threshold.
  3. If confidence is low, treat it as a miss and call the LLM.
  4. After generating a fresh answer, write it back to cache with language, city, and a normalized intent key.

So LangChain vector stores support metadata filtering thus retrievers support score-threshold retrieval. Let me know if you have further questions.

Hi @emylee ,

Language should be treated as part of the cache semantics. Multilingual queries should not be considered cache-equivalent unless you explicitly plan to translate the response. As a result, the cache key must account for both the user’s intent and the response language, even when the language cannot be determined with certainty.

Recommended Architecture: Language-agnostic intent cache with language-specific rendering

This approach provides the most robust and scalable long-term solution.

How it works

Split the caching layer into two distinct parts:

:one: Intent cache (semantic, multilingual, language-independent)

Store only the canonical intent and its structured output, without any language-specific phrasing.

{
  "embedding": [...],
  "metadata": {
    "intent_id": "best_restaurants",
    "city": "rome"
  },
  "payload": {
    "restaurant_ids": [12, 45, 78]
  }
}

Use a multilingual embedding model (e.g., text-embedding-3-large, LaBSE, E5-multilingual) so that semantically equivalent queries across languages map to the same intent.

As a result:

  • “best restaurants”

  • “migliori ristoranti”

  • “mejores restaurantes”

all resolve to the same intent cache entry.

:two: Language-specific response rendering

Once the intent is resolved, generate or retrieve a cached response in the requested language. This keeps retrieval language-agnostic while ensuring the final answer is always language-correct.