Deep Agents: Clarify Multimodal (Image) Context Management and Compression

m7mdhka · April 20, 2026, 1:22pm

Currently, Deep Agents provide mechanisms like`context.compact`and summarization middleware to manage the textual conversation history within the context window. However, with the increasing use of multimodal LLMs and the ability to send images to agents, the handling of image-based context for compression or reduction is unclear.

Images can consume a significant portion of the context window. It’s important for building efficient and scalable multimodal applications to understand if or how Deep Agents’ context management systems address this.

Questions:

Does the existing context.compact system or any other built-in summarization feature in Deep Agents (or LangChain generally when used with Deep Agents) apply to or reduce the token consumption of image inputs?
If not, what are the recommended strategies or best practices for managing image context to prevent exceeding LLM token limits when using Deep Agents with multimodal inputs?
Are there plans to introduce specific features for image context compression or intelligent handling of multimodal content within the context window?

Clarification on this would greatly assist developers in optimizing multimodal Deep Agent applications for token efficiency and performance.

keenborder786 · April 24, 2026, 11:34pm

Hey @m7mdhka welcome to the langchain community.

How token counting works

By Default SummarizationMiddleware uses count_tokens_approximately which is text-based approximation and for images block it just add a flat 85-token penalty per image which results in systematically undercount if your images are anything other than small, low-res.

What actually happens when summarization fires

When the trigger does fire, the images present in the “to-summarize” partition ARE dropped and replaced by a text summary. So summarization does remove images from older turn but the path to getting there is unreliable for image-heavy sessions, and the keep portion (recent messages) always retains images fully.

The backend offloading path loses image data

When SummarizationMiddleware, offloads filtered messages to backend, image blocks (especially base64-embedded ones) are not preserved in a meaningful, recoverable way in this offload.

What does the `FilesystemMiddleware` do with images

The large-tool-result eviction logic in FilesystemMiddleware explicitly excludes image blocks from the size measurement. _extract_text_from_message only counts text. This means a ToolMessage containing only an image block - no matter how large - will never be evicted

Since no built-in image context management exists, you can try following:

a) Use URL references instead of base64 embedding. Pass image URLs in HumanMessage / tool results rather than embedding base64 data. This keeps the token cost down to ~tens of tokens per image (the URL string) rather than thousands. This is the single highest-impact change.

b) Tune the summarization trigger lower when using multimodal inputs. Since count_tokens_approximately underestimates image token cost, set trigger to a lower fraction (e.g., ("fraction", 0.5) rather than the default 0.85) to compensate. This is a blunt instrument but easy to configure.

c) For tool results that return images (e.g., screenshots, charts), store the image to the backend and return a URL/path as the tool result. This avoids base64 entering the message history entirely.

d) Update custom token_counter, to tune tokens_per_image for your actual image sizes. The simplest approach. count_tokens_approximately exposes tokens_per_image as a parameter. You can wrap it with a higher value that matches your workload.

e) Write a custom AgentMiddleware that strips or downsizes image blocks before the model call. You can intercept in wrap_model_call / awrap_model_call, walk the effective message list, and replace old image blocks with a text placeholder like [image evicted — {description}]. This is essentially what FilesystemMiddleware does for text, but for images.

Good Question, perhaps you can initiate this as an issue in deepagents repo. But I cannot answer that since this is a question for official deepagent maintainers.

FYI @mdrxy

I hope this helps!!!

mdrxy · May 28, 2026, 6:34pm

Thanks for the detailed writeup here — this matches my understanding.

Today, Deep Agents does not have a dedicated image-context compression mechanism. The existing text-oriented summarization / compaction paths can remove older messages that contain image blocks once those messages fall into the summarized partition, but they do not intelligently resize, summarize, or preserve images as reusable visual context. Recent messages that are kept in context will still include the images as-is.

So the practical recommendations right now are:

Prefer URLs or file/backend references over base64 image payloads in message history.
Store generated screenshots / charts / images externally and pass references back to the agent.
Use a custom token counter if your workload is image-heavy, since approximate token counting may not reflect actual provider-side image cost.
Consider custom middleware that replaces older image blocks with text summaries or placeholders before model calls.
Tune summarization thresholds more conservatively for multimodal workloads.

We can probably improve the docs here, since this distinction is not very obvious today: existing context management is mostly text/message-history oriented, not true multimodal compression. I’ll look into adding clearer guidance around image inputs, token accounting, and recommended patterns for storing/referencing images rather than keeping large image payloads directly in context.

Topic		Replies	Views
Summarization Middleware Talking Shop	6	336	January 26, 2026
Strategies for Context Management Talking Shop	2	547	October 16, 2025
About a scenario of whether to activate deep_agent Deep Agents python-help	2	128	March 13, 2026
Complete context compression through middleware LangChain intro-to-langgraph , python-help	2	162	March 19, 2026
How to make an image tool? LangChain python-help	11	3600	June 9, 2026

Deep Agents: Clarify Multimodal (Image) Context Management and Compression

How token counting works

What actually happens when summarization fires

The backend offloading path loses image data

What does the FilesystemMiddleware do with images

Related topics

What does the `FilesystemMiddleware` do with images