Deep Agents: Clarify Multimodal (Image) Context Management and Compression

Currently, Deep Agents provide mechanisms like`context.compact`and summarization middleware to manage the textual conversation history within the context window. However, with the increasing use of multimodal LLMs and the ability to send images to agents, the handling of image-based context for compression or reduction is unclear.

Images can consume a significant portion of the context window. It’s important for building efficient and scalable multimodal applications to understand if or how Deep Agents’ context management systems address this.

Questions:

  1. Does the existing context.compact system or any other built-in summarization feature in Deep Agents (or LangChain generally when used with Deep Agents) apply to or reduce the token consumption of image inputs?

  2. If not, what are the recommended strategies or best practices for managing image context to prevent exceeding LLM token limits when using Deep Agents with multimodal inputs?

  3. Are there plans to introduce specific features for image context compression or intelligent handling of multimodal content within the context window?

Clarification on this would greatly assist developers in optimizing multimodal Deep Agent applications for token efficiency and performance.

Hey @m7mdhka welcome to the langchain community.

How token counting works

By Default SummarizationMiddleware uses count_tokens_approximately which is text-based approximation and for images block it just add a flat 85-token penalty per image which results in systematically undercount if your images are anything other than small, low-res.

What actually happens when summarization fires

When the trigger does fire, the images present in the “to-summarize” partition ARE dropped and replaced by a text summary. So summarization does remove images from older turn but the path to getting there is unreliable for image-heavy sessions, and the keep portion (recent messages) always retains images fully.

The backend offloading path loses image data

When SummarizationMiddleware, offloads filtered messages to backend, image blocks (especially base64-embedded ones) are not preserved in a meaningful, recoverable way in this offload.

What does the FilesystemMiddleware do with images

The large-tool-result eviction logic in FilesystemMiddleware explicitly excludes image blocks from the size measurement. _extract_text_from_message only counts text. This means a ToolMessage containing only an image block - no matter how large - will never be evicted

Since no built-in image context management exists, you can try following:

a) Use URL references instead of base64 embedding. Pass image URLs in HumanMessage / tool results rather than embedding base64 data. This keeps the token cost down to ~tens of tokens per image (the URL string) rather than thousands. This is the single highest-impact change.

b) Tune the summarization trigger lower when using multimodal inputs. Since count_tokens_approximately underestimates image token cost, set trigger to a lower fraction (e.g., ("fraction", 0.5) rather than the default 0.85) to compensate. This is a blunt instrument but easy to configure.

c) For tool results that return images (e.g., screenshots, charts), store the image to the backend and return a URL/path as the tool result. This avoids base64 entering the message history entirely.

d) Update custom token_counter, to tune tokens_per_image for your actual image sizes. The simplest approach. count_tokens_approximately exposes tokens_per_image as a parameter. You can wrap it with a higher value that matches your workload.

e) Write a custom AgentMiddleware that strips or downsizes image blocks before the model call. You can intercept in wrap_model_call / awrap_model_call, walk the effective message list, and replace old image blocks with a text placeholder like [image evicted — {description}]. This is essentially what FilesystemMiddleware does for text, but for images.

Good Question, perhaps you can initiate this as an issue in deepagents repo. But I cannot answer that since this is a question for official deepagent maintainers.

FYI @mdrxy

I hope this helps!!!