Summarization Middleware

I am posting with a new query. I have a multi-agent system flow using create_agent as my supervisor and other sub_agents exposed as a tools.

The supervisor agent is defined as follows:

supervisor = create_agent(
            model=model,
            system_prompt=supervisor_prompt,
            tools=[fetch_patient_information_tool, manage_appointment_tool, book_appointment_tool],
            checkpointer=checkpointer,
            name="supervisor_agent",
            middleware=[
            SummarizationMiddleware(
                model = model,
                trigger=("tokens", 4000),
                keep=("messages", 5),
            ),

The the sub_agent used in the fetch_patient_information_tool is defined as follows:

return create_agent(
        model=model,
        tools=tools,
        system_prompt=prompts['patient_information_retrieval_prompt'].format(LANGUAGE_ID=language_id),
        name="patient_information_retrieval_agent", 
        checkpointer=checkpointer,
        middleware=[
            SummarizationMiddleware(
                model = model,
                trigger=("tokens", 4000),
                keep=("messages", 5),
            ),]
    )

The other sub_agents are also defined as similar.

As I have defined the SummarizationMiddleware that will be triggered when tokens goes above 4000, but in my case the token goes to approximately 5k tokens but it is not triggering.

I need guidance, what could be the issue and also what are the best approaches to limit the tokens usange.

Note: I am using Groq as the model provider.

Thanks in Advance.

Hi @razaullah

I have two thoughts:

1) It triggers before the next model call, not immediately when you cross 4000

SummarizationMiddleware runs in before_model(...) (i.e., it checks the message history right before calling the LLM). If you observe “the last request used ~5k prompt tokens”, that can be the request that just happened; the summarization would only run on the next LLM call in that thread/run.

Practical debug check: add logging right before you call agent.invoke(...) to print the current token count for the messages you’re about to send, not the tokens reported after the call.

2) Your ~5k tokens is likely Groq’s full prompt tokens, while the middleware’s trigger is based on state messages only (and by default it’s approximate)

In the implementation, the tokens trigger is based on:

  • messages = state["messages"]
  • total_tokens = token_counter(messages) (default: count_tokens_approximately, i.e. character-based counting)

So the trigger does not include:

  • the agent’s system_prompt if it’s not stored in state["messages"]
  • tool schema definitions (often large)
  • any provider-side wrapper tokens / formatting overhead

This is why you can see prompt_tokens ~5000 in Groq usage, while the middleware still thinks the message history is under 4000 and does nothing.

Key doc detail: LangChain’s middleware docs explicitly note token_counter “defaults to character-based counting” and can be customized (LangChain docs: Built-in middleware - Docs by LangChain).

What to do (concrete fixes)

A) If you want something reliable across providers: trigger by message count

If your main goal is “don’t let history grow unbounded”, use:

  • trigger=("messages", N) (reliable)

  • and optionally keep=("messages", K) or keep=("tokens", K) depending on your preference

This avoids tokenizer/provider differences entirely.

B) If you truly want token-based triggers with Groq: pass a custom token_counter

Per the docs, SummarizationMiddleware(..., token_counter=...) is supported. For Groq Llama/Qwen-family models, the most accurate approach is to use the matching HuggingFace tokenizer and count tokens on the rendered conversation.

from transformers import AutoTokenizer
from langchain_core.messages.utils import get_buffer_string

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B")  # match your Groq model family

def groq_token_counter(messages) -> int:
    text = get_buffer_string(messages)
    return len(tokenizer.encode(text))

SummarizationMiddleware(
    model=model,
    trigger=("tokens", 4000),
    keep=("messages", 5),
    token_counter=groq_token_counter,
)

This won’t perfectly match provider-side “prompt token” accounting (because tool schemas/system prompts may still be outside state["messages"]), but it’s substantially closer than the default character heuristic.

C) Reduce tokens that summarization cannot remove (system prompt + tool bloat)

In multi-agent systems, a lot of prompt tokens come from tool definitions + big system prompts, not just message history. Summarizing messages won’t help much if the “fixed overhead” is the problem.

Two best-practice middleware patterns from the official docs that directly target that:

  • Use an LLM tool selector to reduce how many tool definitions are “in play” on each turn (LangChain docs: Built-in middleware - Docs by LangChain → “LLM tool selector”).
  • Use context editing to clear older tool outputs (tool outputs can be massive) (LangChain docs: Built-in middleware - Docs by LangChain → “Context editing” / ClearToolUsesEdit).

D) Put hard bounds on generation

Even with perfect input trimming, output can balloon. ChatGroq supports max_tokens (see instantiation example in Groq integration docs: ChatGroq - Docs by LangChain). Set it to a sane upper bound for your use case.

Quick checklist to diagnose your exact case

  1. Print the middleware’s view of history size: “what does token_counter(state["messages"]) return right before the call?”
  2. Compare to Groq’s reported usage_metadata["input_tokens"] for that call.
    • If Groq >> middleware: you’re paying for system prompt/tool schema overhead; summarize/trim tools, not just messages.
  3. If you only see the threshold exceeded after the call: expect summarization to happen on the next LLM call (because it runs in before_model).

@pawel-twardziak Thank you for your response. As you mentioned that system_prompts and tool shemas are outside of the state[“messages”], can you mention what is stored inside state[“messages”], also my tool responses leads to context increasing, what are the best ways to handle it too.

Note: I am using openai/gpt-oss-120b model.

Hi @razaullah

1) What is stored inside state[“messages”]?

state[“messages”] is the agent’s conversation history as a list of LangChain message objects.

In the agent state schema, it’s defined as messages: list[AnyMessage] (with LangGraph’s add_messages reducer).

What kinds of messages are in it

AnyMessage includes the normal chat message types you see in agent loops, e.g.:

  • Human messages (user inputs)
  • AI messages (model outputs; may include tool calls)
  • Tool messages (tool results; appended after tools execute)
  • System messages can exist as message objects in general, but in create_agent the system prompt is handled separately

You can see in create_agent’s docstring that the loop works by:

  • model produces an AIMessage with tool_calls
  • tools run and their outputs are added as ToolMessage objects
  • model is called again with the updated message list

Why system prompts / tool schemas often aren’t counted in state[“messages”]

Two key implementation details:

  • The middleware (including SummarizationMiddleware) reads only state[“messages”]
  • The model call uses a separate system_message field in the ModelRequest, and only prepends it at call time. Also ModelRequest.messages is explicitly documented as “excluding system message”

So: state[“messages”] is the conversation + tool results history, but the system prompt may be outside it, and tool schemas are not messages at all (they’re passed via the tools bindings at model-call time).

2) Tool responses increase context - best ways to handle it

A) Use Context Editing to clear older tool outputs (best “drop-in” fix)

LangChain provides ContextEditingMiddleware with ClearToolUsesEdit, which is specifically designed to remove older tool outputs while preserving the most recent N tool results.
Docs: Built-in middleware → Context editing (section “Context editing” / ClearToolUsesEdit).
This directly targets your stated pain: large ToolMessage content bloating context.

B) Use SummarizationMiddleware, but recognize what it can/can’t shrink

SummarizationMiddleware summarizes older messages (and then keeps only a configured suffix). It triggers based on counting state[“messages”] (default token counting is approximate; docs call this out).

This helps with tool results only insofar as those tool results are inside state[“messages”] (they are), but it won’t reduce tool schema overhead.

C) Reduce tool schema and selection overhead (prevents tool bloat from being injected every turn)

If you have many tools, passing all tool definitions every call can be expensive. A strong pattern is selecting only a small subset of tools per turn.
LangChain supports this via LLMToolSelectorMiddleware.

D) Make tool outputs “reference-based” (architectural best practice)

Even with middleware, the cheapest tokens are the ones you never generate:

  • Have tools store large payloads externally (DB / object storage / vector store)
  • Return a short summary + an ID/link, not the full raw text
  • For retrieval tools, return top-k small snippets, not entire documents

Note about your model (openai/gpt-oss-120b)

Your provider change doesn’t change what state[“messages”] contains. It may, however, affect how accurately token triggers work (because SummarizationMiddleware defaults to approximate counting unless you override token_counter.

@pawel-twardziak . Your suggestions worked for me. After some interactions I observe that the SummarizationMiddleware works but when tokens exceeds after certain limit i.e. Human, AI and tool messages. Before I was looking into the system prompt too that should be encountered in the tokens count but as you mentioned it is is not part of the state[“messages”].

Thank you very much for your time and guidance.

1 Like

Also, @pawel-twardziak, would it be advisable to use the Toon format, particularly for tool return data, as a means to minimize token usage compared to JSON? I’m interested in understanding whether this would have a significant impact on token consumption, as well as any potential drawbacks or limitations associated with this format.

hi @razaullah

TOON format is quite a new one, neither battle-tested nor as stable as JSON. It’s not yet implemented in the Langchain ecosystem. If it were, there might be a slight token reduction, but I wouldn’t expect any significant impact.