❓ How to Reduce Double Agent Calls in React architecture (LangGraph) & Reduce Latency

Hi everyone,

I’m building a LangGraph pipeline where I have a rag_agent that sometimes ends up calling two agents sequentially, which doubles my LLM latency.

Here’s what my trace looks like (screenshot attached):

rag_agent
 ├── agent → call_model (2.29s)
 ├── tools → qdrant_pdf_form_retriever (0.49s)
 └── agent → call_model (2.38s)

Each time, the second agent call triggers a new LLM request, so my total latency jumps to 5+ seconds even for a simple query.

Questions:

  1. How can I structure my LangGraph / React (AI runtime) so that:

    • The agent is only called once if the first call already got a result?

    • I can “short-circuit” the graph to avoid calling the second agent unnecessarily?

    • Are there best practices for latency reduction in multi-agent setups (e.g., caching, combining prompts, conditional nodes)?

  2. Another question — should I manually call the tool on the first step (bypassing the LLM tool-selection), then feed the results back into the LLM?

    • My worry is that the user’s query can often be sloppy (typos, vague wording), so if I skip the first LLM pass, my vector store search might not be as precise.

    • Is this a recommended approach — let the LLM “clean” or rephrase the query first to generate better tool arguments, even if that adds latency?

Any example graphs, code snippets, or best practices would be super helpful. :folded_hands:

Thanks!


Hi @Petar :waving_hand: what is the definition of the graph and what is the definition of the RAG Agent? That would help to figure out the solution.

And regarding the theoretical questions, IMHO:

Practical latency levers for multi-agent setups

  • Cache aggressively
    • Embedding cache for documents + retrieval cache keyed by (query_norm, top_k, filters).
    • LLM response cache keyed by (system+prompt+docs hash, model, temperature, tools-mask). LangChain has InMemoryCache/RedisCache; at the infra level, add your own layer because tool outputs are deterministic for a given key.
  • Rerankers > second LLM pass
    • Use a lightweight cross-encoder reranker (e.g., Cohere Rerank/bge-reranker on CPU) to improve context quality without another LLM call.
  • Parallelism with cancellation
    • Kick off retrieval in parallel with the planner call (speculative). If the planner decides “no tool,” cancel the retrieval; if it wants a tool, you may already have the results.
  • Smaller models for planning/rewrite
    • Do “query rewrite” or “tool arg extraction” with a small, fast model, keep the big model only for final answer.
  • Tight prompts
    • Use explicit guardrails: “You may call at most one tool,” “If confidence < X, answer directly,” “Prefer final answer if retrieved docs do not change the top-3 facts.”
  • Hybrid retrieval
    • BM25 + dense + field filters; tolerate typos with fuzzy match; this reduces the need for an LLM rewrite just to make retrieval work.
  • Batch user messages
    • If your UI sends keystroke previews, debounce on the client and only send one server request per “submit”.
  • Stream early, resolve late
    • Start streaming the fallback answer immediately. If a tool returns high-impact info, update the answer (React/AI runtime makes this feel natural).

Should you manually call the tool first?

Short answer: often yes - but with safeguards.

When “tool-first” is great

  • Your retriever is robust (hybrid retrieval, typo-tolerant, synonyms).
  • You employ a cheap rewrite/normalize step only when needed (router pattern).
  • You rerank to ensure the LLM gets tight context.

When to let the LLM go first

  • The domain has messy, multi-intent queries where a plan is needed (e.g., “Compare vendor X v. Y, then draft an email”).
  • The tool needs non-obvious parameters that benefit from an LLM “schema fill”.

Balanced approach (recommended):

  1. Heuristic/embedding router decides:
    • “Looks retrievable” - run retriever immediately.
    • “Messy/ambiguous” - quick small-model rewrite, then retrieve.
  2. Only if the plan explicitly says the retrieved info changes the answer - do a second model call to integrate.

Hi! Thanks a ton for the thoughtful pointers — super helpful.

My setup (so we’re on the same page):

  • Platform: Deployed on LangGraph Platform.

  • Agent style: ReAct via create_react_agent(…).

  • Tools:

    • qdrant_pdf_form_retriever → returns { text, score, chunk_id, page, page_label } (page labels are required in every final answer).

    • get_document_insights (one-time overview).

  • Model: typically gpt-4.1-mini (temp=0), sometimes overridden per-request via config.

  • State: user_id, form_id, embedding_dimension, etc.

  • Typical trace (latency pain):

rag_agent
 ├─ agent → call_model (~2.3s)
 ├─ tools → qdrant_pdf_form_retriever (~0.5s)
 └─ agent → call_model (~2.4s)
  • So I end up with 2x LLM calls and 5s+ even for simple queries.

What the code guarantees today (re: citations):

  • The retriever always returns page-level metadata and I format the final answer to include page_label alongside any quoted text.
# Tool (shortened)
@tool
async def qdrant_pdf_form_retriever(..., query: str):
    # search in Qdrant (filtered by user_id/form_id/embedding_dim)
    # ...
    return Command(update={
        "retrieved_documents": [
            {
              "text": doc.payload.get('content', ''),
              "score": float(doc.score),
              "chunk_id": doc.payload.get('chunk_index'),
              "page": doc.payload.get('page_number'),
              "page_label": doc.payload.get('page_label'),
            } for doc in retrieved_docs
        ],
        "messages": [ToolMessage(content_summary, tool_call_id=tool_call_id)]
    })

Questions for you (caching & graph shape):

  1. Where to put the cache in a LangGraph Platform deployment?

    • Would you recommend:

    • (a) LLM response cache

    • (b) Retrieval cache

    • (c) Embedding cache

      I’m trying to understand the best practice specifically on LangGraph Platform: do you typically front this with Redis/Memcache, or rely on LC’s built-in cache in-process, or something platform-native?

  2. Graph definition to short-circuit the second LLM call.

    Right now the ReAct pattern yields: agent → tool → agent. Would you advise splitting into explicit nodes:

    router (small/fast model or heuristic) → retriever → synth (single LLM)

    • so that final synthesis is one call, and we avoid the second agent → call_model?

      If you’ve done this on LangGraph Platform, any minimal example or recommended node pattern?

  3. Cancellation/parallelism in practice.

    On LangGraph Platform, is there a recommended way to kick off retrieval in parallel with a quick planner call and cancel the losing task? Any gotchas you’ve seen?


What I’m aiming for

  • Always return citations with page_label.

  • At most one LLM synthesis call per query in the common case.

  • Aggressive caching (retrieval + LLM) without breaking tool determinism.

If you can share a small snippet that shows how you’d wire the router → retrieve → synth pattern (and where you’d attach caches in a LangGraph Platform deployment), that would help me a lot. Thanks again!

Hi @Petar

What about this? Let me know whether it make sense for you.

How to reduce double agent calls (LangGraph) and lower latency

TL;DR

  • Avoid ReAct loops for RAG. Build an explicit graph: router (optional) → retrieve → synth. That yields a single LLM call in the common path.
  • Short‑circuit with control flow (conditional edges or Command.goto) so you never do a second LLM call.
  • Add node-level caching (LangGraph CachePolicy) + LLM cache (LangChain cache) + retrieval cache (TTL keyed by query + filters). Consider embedding cache on ingestion side.
  • Run cheap steps in parallel (e.g., retrieval + insights) and defer/merge before a single synth call. Stream only the synth tokens to the UI.

Why you see two LLM calls

create_react_agent(...) follows a tool-calling loop: agent → tool → agent. For RAG, that means (1) a planning/tool-selection LLM call and (2) a final synthesis call. If you already know you want retrieval, you can remove the first LLM call by controlling the graph yourself.

Relevant docs: Graph API (custom control flow, caching), prebuilt agents (ReAct), streaming, persistence.

  • Graph API: Add node caching, conditional edges, parallelism, Command-based routing
  • Prebuilt agent quickstart: ReAct adds an extra LLM step
  • Streaming: stream tokens from a single synth call

Recommended graph shape (single LLM call)

  • router (optional): small/fast model or heuristic deciding whether to use RAG
  • retrieve: call your qdrant_pdf_form_retriever tool; update state with {retrieved_documents, messages: [ToolMessage(...)]}
  • synth: one LLM call that uses retrieved context and must output citations including page_label
# pip install -U langgraph "langchain[openai]"  # or provider you use
from typing_extensions import TypedDict, Annotated
from typing import List, Dict, Any
import operator

from langchain.chat_models import init_chat_model
from langchain_core.messages import AnyMessage, HumanMessage, AIMessage, SystemMessage
from langgraph.graph import StateGraph, START, END, MessagesState
from langgraph.graph.message import add_messages
from langgraph.types import CachePolicy, Command
from langgraph.cache.memory import InMemoryCache  # or SqliteCache

# ----- State -----
class State(TypedDict):
    messages: Annotated[List[AnyMessage], add_messages]
    query: str
    retrieved_documents: Annotated[List[Dict[str, Any]], operator.add]
    should_rag: bool

# ----- Optional router (cheap) -----
# Heuristic example: do RAG unless the query is very short/simple
# Replace with a small model if you need semantic routing.
def router(state: State):
    q = state.get("query", "").strip()
    do_rag = not (len(q) < 6 or q.lower() in {"hi", "hello", "thanks"})
    return {"should_rag": do_rag}

# ----- Retrieval -----
# If you already have an @tool for qdrant, you can call it directly here.
# This node should populate retrieved_documents and (optionally) a ToolMessage.
def retrieve(state: State):
    query = state["query"]
    # Example: YOUR existing tool invocation (pseudo)
    # results = await qdrant_pdf_form_retriever(query=query, ...)
    # Here we simulate one doc with the fields you guaranteed.
    results = [
        {
            "text": "...doc text...",
            "score": 0.82,
            "chunk_id": 12,
            "page": 4,
            "page_label": "4",
        }
    ]
    return {
        "retrieved_documents": results,
        # If you keep messages, you can add a short summary message for transparency
        "messages": [("tool", f"Retrieved {len(results)} passages for query: {query}")]
    }

# ----- Synthesis (single LLM call) -----
MODEL = init_chat_model("openai:gpt-4.1-mini", temperature=0)  # or your default

def synth(state: State):
    docs = state.get("retrieved_documents", [])
    content_blocks = []
    for d in docs:
        content_blocks.append(
            f"[page_label={d.get('page_label')}] {d.get('text','')}"
        )
    system = SystemMessage(
        content=(
            "You are a RAG assistant. Answer strictly from the provided context. "
            "Always include page_label next to any quote or citation."
        )
    )
    user = HumanMessage(content=state["query"]) 
    context = HumanMessage(content="\n\n".join(content_blocks) or "No context")

    answer = MODEL.invoke([system, context, user])
    return {"messages": [answer]}

# ----- Build graph -----
b = StateGraph(State)

b.add_node("router", router)
b.add_node("retrieve", retrieve, cache_policy=CachePolicy(ttl=120))
# Cache synth only if your prompts + docs are stable; otherwise prefer retrieval-only cache
b.add_node("synth", synth, cache_policy=CachePolicy(ttl=60))

b.add_edge(START, "router")

# Conditional: if router says no RAG, go straight to synth with empty context
from typing import Literal

def route_after_router(state: State) -> Literal["retrieve", "synth"]:
    return "retrieve" if state.get("should_rag", True) else "synth"

b.add_conditional_edges("router", route_after_router, ["retrieve", "synth"])
b.add_edge("retrieve", "synth")
b.add_edge("synth", END)

graph = b.compile(cache=InMemoryCache())  # swap to SqliteCache() for persistence

# ----- Usage -----
inputs = {
    "messages": [
        ("user", "Find the filing deadlines and cite the page labels."),
    ],
    "query": "Find the filing deadlines and cite the page labels.",
}
result = graph.invoke(inputs)
# The last message is your answer, produced by a single LLM call

Notes

  • The graph never loops agent→tool→agent. Retrieval is a normal node; synthesis is a single LLM call.
  • CachePolicy adds per-node caching with TTL; the compile(cache=...) enables node caching.
  • Keep messages if you stream progress to the UI; otherwise you can remove message history.

Short-circuiting and control flow

  • If retrieval is not needed, route directly to synth (as above).
  • You can also short‑circuit with Command from a node, e.g., return Command(update=..., goto="synth") to skip downstream nodes.
  • Set a low recursion limit if you ever introduce loops; for this linear graph, recursion is unnecessary.

Parallelism and “cancel the loser”

  • You can fan out cheap steps (e.g., retrieve and get_document_insights) in parallel and then fan in before synth. Use a deferred aggregator or route to synth once required state is present.
  • LangGraph executes parallel branches per superstep. There isn’t preemptive cancellation of sibling nodes at the framework level; instead, design the aggregator to ignore late/optional results or use timeouts inside nodes. See parallel execution and defer=True for fan-in control.

Caching on LangGraph Platform

Consider three layers:

  • LLM response cache (LangChain): set a global cache so identical prompts+inputs return instantly across processes.
    • Example (Redis):
      from langchain.globals import set_llm_cache
      from langchain.cache import RedisCache
      set_llm_cache(RedisCache(host="<redis-host>", port=6379))
      
  • Retrieval cache (LangGraph node cache): use CachePolicy(ttl=...) on the retrieve node and compile with cache=.... Key includes node input and config; add your filters (user_id, form_id, embedding_dim) to the state so they’re part of the cache key.
  • Embedding cache: prefer caching at ingestion or use deterministic IDs to avoid recomputing; for on-the-fly embeddings, wrap your embedder with a small KV cache (e.g., Redis) keyed by text hash.

On Platform, InMemoryCache is per-process. Use SqliteCache for simple persistence or connect LangChain’s LLM cache to a shared store (e.g., Redis). Retrieval results are excellent candidates for node-level TTL caches; keep LLM caches shorter and scoped, since prompts often change.

Should you rewrite the query first?

  • If retrieval quality is sensitive to messy queries, use a conditional normalizer:
    • Default: skip rewrite and retrieve immediately.
    • If top-1 score < threshold or no results, then run a fast rewrite (small model) and retry retrieval.
  • This keeps the common path at one LLM call while preserving quality when needed.
from langchain.chat_models import init_chat_model
FAST = init_chat_model("openai:gpt-4o-mini", temperature=0)

def maybe_rewrite(state: State):
    # Only rewrite when needed (e.g., detected low recall previously)
    q = state["query"]
    rewritten = FAST.invoke([
        ("system", "Rewrite the query to maximize vector recall; keep meaning."),
        ("user", q),
    ])
    return {"query": rewritten.content}

Wire this into the graph after you detect low recall, not on the happy path.

1 Like

Big Thanks! I appriciate what you write! i ll consider some parts.

1 Like