LangGraph Parallelism

I’m seeing a discrepancy between Graph Topology and Execution Reality. My LangGraph has a clear fan-out from initialize to four parallel paths (including guardrail and preprocess subgraphs), yet my Langfuse traces show a “staircase” effect instead of vertical overlap.

Despite using precompiled graphs to eliminate overhead, the nodes aren’t executing concurrently. Specifically:

  • The Execution Gap: In the preprocess subgraph, topic_selection (9.50s) starts only after extraction (3.84s) finishes.

  • Subgraph Overhead: Is this “staircase” an inherent LangGraph mechanism where the orchestrator must checkpoint state before starting the next parallel task?

  • Non-Blocking Traces: I’ve confirmed Langfuse tracing isn’t the bottleneck, so the delay is happening within the graph’s task coordination.

Has anyone else faced this sequential execution issue with parallel nodes?

I’m looking for recommendations on how to achieve true 0-gap concurrency.

  • Is this a State-locking behavior or a Python GIL/Async limitation?

  • Any specific search terms or configurations to fix this?

  • If you have a Langfuse trace where parallel bars actually overlap, I’d love to see your setup.

Hi @theodevs ,

The runtime (async vs. sync) and whether your nodes are truly async + non-blocking determine true concurrency in LangGraph; parallel edges do not ensure parallel execution.

Also, you can also look for LangGraph parallel edges async execution, RunnableParallel, and async node execution LangGraph in the LangChain/LangGraph async execution docs. Avoid blocking code, and make sure all LLM/tool calls are fully async for true 0-gap overlap; otherwise, it’s typically runtime behavior rather than state-locking.

Reference: Durable execution - Docs by LangChain

hi @Bitcot_Kaushal

As a reference, here is the simple code structure I’m using:

  1. Guardrail Subgraph

class GuardrailInput(TypedDict):
    question: str

class GuardrailOutput(TypedDict):
    output: str

@observe()
async def guardrail(state: GuardrailInput) -> GuardrailInput:
    # Write to OverallState
    model = ChatOpenAI(name=settings.local_chat.large_model, api_key=settings.local_chat.large_api_key, base_url=settings.local_chat.large_api_base)
    result = await model.ainvoke(f"Return false or true wether this user query is negative or not {state.get('question')}")
    output = result.content
    return {"output": output}


guardrail_graph = StateGraph(GuardrailInput, output_schema=GuardrailOutput)
guardrail_graph.add_node("guardrail", guardrail)
guardrail_graph.add_edge(START, "guardrail")
guardrail_graph.add_edge("guardrail", END)

guardrail_graph = guardrail_graph.compile()

  1. Short Answer Subgraph
class shortInput(TypedDict):
    question: str

class shortOutput(TypedDict):
    output: str


@observe()
async def short(state: shortInput) -> shortInput:
    model = ChatOpenAI(name=settings.local_chat.large_model, api_key=settings.local_chat.large_api_key, base_url=settings.local_chat.large_api_base)
    # Write to OverallState
    result = await model.ainvoke(f"answer this user query in 30 sentence long: {state.get('question')}")
    output = result.content
    return {"output": output}


short_graph = StateGraph(shortInput, output_schema=shortOutput)
short_graph.add_node("short", short)

short_graph.add_edge(START, "short")
short_graph.add_edge("short", END)

short_graph = short_graph.compile()

  1. Main Graph that calling Subgraph
class MainInput(TypedDict):
    question: str
    guardrail: str
    short: str

class MainOutput(TypedDict):
    guardrail: str
    short: str

@observe()
async def guardrail(state: MainInput) -> MainOutput:
    # Write to OverallState
    
    result = await guardrail_graph.ainvoke({'question': state.get('question')})
    
    return {"guardrail": result}


@observe()
async def short(state: MainInput) -> MainOutput:
    # Write to OverallState
    result = await short_graph.ainvoke({'question': state.get('question')})
    return {"short": result}

@observe()
async def orchestrator(state: MainInput) -> MainOutput:
    # Write to OverallState
    print(state)
    return {"state": state}


main_graph = StateGraph(MainInput, output_schema=MainOutput)
main_graph.add_node("guardrail", guardrail)
main_graph.add_node("short", short)
main_graph.add_node("orchestrator", orchestrator)

main_graph.add_edge(START, "short")
main_graph.add_edge(START, "guardrail")
main_graph.add_edge('short', 'orchestrator')
main_graph.add_edge('guardrail', "orchestrator")
main_graph.add_edge("orchestrator", END)

main_graph = main_graph.compile()

This is how i start my code

@observe()
async def main():
    result = await main_graph.ainvoke({"question": 'hi'})
    return result

result = await main()

This is the running result

I’m seeing a significant latency gap in my short node within this LangGraph implementation. I’ve confirmed that Langfuse is not the cause, and I am already using asynchronous ainvoke throughout the process.

  1. Orchestration Overhead: Is this delay an inherent behavior of nesting StateGraph objects, or is there a bottleneck in how I’ve structured these parallel transitions to the orchestrator?

  2. Architecture Feedback: Do you see any issues with how MainInput and MainOutput are managing the merged state that could be causing a processing lag?

Hi @theodevs

what local provider and model are you using? Ollama?