Hi @razaullah
I have two thoughts:
1) It triggers before the next model call, not immediately when you cross 4000
SummarizationMiddleware runs in before_model(...) (i.e., it checks the message history right before calling the LLM). If you observe “the last request used ~5k prompt tokens”, that can be the request that just happened; the summarization would only run on the next LLM call in that thread/run.
Practical debug check: add logging right before you call agent.invoke(...) to print the current token count for the messages you’re about to send, not the tokens reported after the call.
2) Your ~5k tokens is likely Groq’s full prompt tokens, while the middleware’s trigger is based on state messages only (and by default it’s approximate)
In the implementation, the tokens trigger is based on:
messages = state["messages"]
total_tokens = token_counter(messages) (default: count_tokens_approximately, i.e. character-based counting)
So the trigger does not include:
- the agent’s
system_prompt if it’s not stored in state["messages"]
- tool schema definitions (often large)
- any provider-side wrapper tokens / formatting overhead
This is why you can see prompt_tokens ~5000 in Groq usage, while the middleware still thinks the message history is under 4000 and does nothing.
Key doc detail: LangChain’s middleware docs explicitly note token_counter “defaults to character-based counting” and can be customized (LangChain docs: Built-in middleware - Docs by LangChain).
What to do (concrete fixes)
A) If you want something reliable across providers: trigger by message count
If your main goal is “don’t let history grow unbounded”, use:
-
trigger=("messages", N) (reliable)
-
and optionally keep=("messages", K) or keep=("tokens", K) depending on your preference
This avoids tokenizer/provider differences entirely.
B) If you truly want token-based triggers with Groq: pass a custom token_counter
Per the docs, SummarizationMiddleware(..., token_counter=...) is supported. For Groq Llama/Qwen-family models, the most accurate approach is to use the matching HuggingFace tokenizer and count tokens on the rendered conversation.
from transformers import AutoTokenizer
from langchain_core.messages.utils import get_buffer_string
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B") # match your Groq model family
def groq_token_counter(messages) -> int:
text = get_buffer_string(messages)
return len(tokenizer.encode(text))
SummarizationMiddleware(
model=model,
trigger=("tokens", 4000),
keep=("messages", 5),
token_counter=groq_token_counter,
)
This won’t perfectly match provider-side “prompt token” accounting (because tool schemas/system prompts may still be outside state["messages"]), but it’s substantially closer than the default character heuristic.
C) Reduce tokens that summarization cannot remove (system prompt + tool bloat)
In multi-agent systems, a lot of prompt tokens come from tool definitions + big system prompts, not just message history. Summarizing messages won’t help much if the “fixed overhead” is the problem.
Two best-practice middleware patterns from the official docs that directly target that:
D) Put hard bounds on generation
Even with perfect input trimming, output can balloon. ChatGroq supports max_tokens (see instantiation example in Groq integration docs: ChatGroq - Docs by LangChain). Set it to a sane upper bound for your use case.
Quick checklist to diagnose your exact case
- Print the middleware’s view of history size: “what does
token_counter(state["messages"]) return right before the call?”
- Compare to Groq’s reported
usage_metadata["input_tokens"] for that call.
- If Groq >> middleware: you’re paying for system prompt/tool schema overhead; summarize/trim tools, not just messages.
- If you only see the threshold exceeded after the call: expect summarization to happen on the next LLM call (because it runs in
before_model).