Cache disable in Deepagent

How can i disable cache while using deepagents, i m using litllm for that

deepagent auto add cache headers, but haiku model not support it

{“error_code”:“BAD_REQUEST”,“message”:“{\“message\”:\“cache_control: Extra inputs are not permitted\”}”

hi @AmitPZepto

what I can see from the souce code is that there is no public flag on create_deep_agent to turn the prompt-cache middleware off. It is appended unconditionally to the middleware stack (deepagents/graph.py:462, :520, :591):

# graph.py (deepagents)
gp_middleware.append(AnthropicPromptCachingMiddleware(unsupported_model_behavior="ignore"))
# ...
subagent_middleware.append(AnthropicPromptCachingMiddleware(unsupported_model_behavior="ignore"))
# ...
deepagent_middleware.append(AnthropicPromptCachingMiddleware(unsupported_model_behavior="ignore"))

The good news: that middleware is a no-op for any model that is not a ChatAnthropic instance - including ChatLiteLLM. So the right fix depends on why you are seeing cache_control headers at all.


How the cache middleware actually decides to fire

AnthropicPromptCachingMiddleware._should_apply_caching gates on an isinstance check:

# langchain_anthropic/middleware/prompt_caching.py
def _should_apply_caching(self, request: ModelRequest) -> bool:
    if not isinstance(request.model, ChatAnthropic):
        msg = (
            "AnthropicPromptCachingMiddleware caching middleware only supports "
            f"Anthropic models, not instances of {type(request.model)}"
        )
        if self.unsupported_model_behavior == "raise":
            raise ValueError(msg)
        if self.unsupported_model_behavior == "warn":
            warn(msg, stacklevel=3)
        return False
    ...

Because deepagents constructs it with unsupported_model_behavior="ignore", a non-Anthropic model silently skips caching - no cache_control block, no warning, no error. A real langchain_litellm.ChatLiteLLM instance would not get cache headers added.

So if you are seeing cache headers reach the wire, one of the following is true:

  1. You are passing a model string (e.g. "anthropic:claude-haiku-4-5") to create_deep_agent. Internally resolve_model() calls init_chat_model() (deepagents/_models.py:45), which constructs a ChatAnthropic - not LiteLLM. The middleware then does fire
  2. You are pointing ChatAnthropic at a LiteLLM proxy via base_url=.... It is still a ChatAnthropic instance, so cache headers are injected; whether the proxy forwards them correctly is a separate problem
  3. A custom profile added its own cache middleware - unlikely unless you wrote one

Fix 1 - actually use

If your intent is “talk to Haiku through LiteLLM”, instantiate ChatLiteLLM yourself and pass the instance (not a string). The cache middleware will see it is not ChatAnthropic and silently skip (unsupported_model_behavior="ignore").

from langchain_litellm import ChatLiteLLM
from deepagents import create_deep_agent

llm = ChatLiteLLM(model="claude-3-5-haiku-20241022", temperature=0)

agent = create_deep_agent(
    model=llm,            # pass the instance, NOT a string like "anthropic:..."
    tools=[...],
    system_prompt="...",
)

Docs: ChatLiteLLM integration.

With this setup there are no cache_control blocks in the outbound payload at all - verify by enabling LiteLLM debug logging (litellm._turn_on_debug()). If you still see them, your model is not what you think it is; inspect type(agent.nodes[...].runnable.model) or just print the bound chat model.

Fix 2 - if you must keep ChatAnthropic but do not want caching

No public API exists today, so your options are:

(a) Subclass / replace the middleware. Build your own no-op class and ship it; you still cannot remove the deepagents-appended one, but you can post-process the request in your own middleware that runs inside it:

from langchain.agents.middleware.types import AgentMiddleware

class StripCacheControl(AgentMiddleware):
    def wrap_model_call(self, request, handler):
        # Remove cache_control that the Anthropic middleware just injected
        ms = dict(request.model_settings or {})
        ms.pop("cache_control", None)
        request = request.override(model_settings=ms)
        # also strip from system + tools if you need to be thorough
        return handler(request)

the AnthropicPromptCachingMiddleware is appended after user middleware= in graph.py:580-591, so your middleware wraps it from the outside. In the LangChain agents middleware model, wrap_model_call composes like an onion: the last-appended middleware runs closest to the model, which means your middleware runs after the cache middleware on the response but before it on the request - so stripping on the way in will be re-added by the Anthropic middleware on its way down. The practical way to kill it is a monkey-patch:

Apply this once at import time, before calling create_deep_agent. Ugly, but it is the only reliable switch today.

(b) Open a feature request. A disable_prompt_cache: bool = False kwarg on create_deep_agent (or better, letting the caller fully replace the tail middleware) is a reasonable ask - track or file it at https://github.com/langchain-ai/deepagents/issues.


Worth sanity-checking the premise. Per Anthropic’s prompt-caching docs, Claude 3 Haiku, Claude 3.5 Haiku and Claude Haiku 4.5 all support prompt caching (2,048-token minimum for the Haiku family). If your LiteLLM call is failing with a cache-related error, the likely culprit is not the model - it is that:

  • your LiteLLM version does not forward cache_control blocks on the Anthropic path, or
  • LiteLLM is routing to a backend that does not (e.g. Bedrock Claude Haiku, where caching support/headers differ), or
  • the request is below the 2,048-token minimum (this should be a silent no-cache, not an error - if it errors, the proxy is rejecting the block).

If you share the actual error message from LiteLLM, the root cause is usually identifiable without disabling caching at all. But if you just want it gone: use Fix 1.

Sources

Would love to know more about your use case @AmitPZepto – what does LiteLLM caching do better? Are you using ChatLiteLLM?

We have customization in the pipeline. Knowing more will help us get it right.

but same here

error: Bad request (400): Error code: 400 - {‘error’: {‘message’: 'litellm.BadRequestError: DatabricksException - {“error_code”:“BAD_REQUEST”,“message”:"{\“message\”:\“cache_control: Extra inputs are not permitted\”}

import logging

from deepagents import create_deep_agent
from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool

load_dotenv()

# Databricks in LiteLLM proxy rejects cache_control - disable caching

logging.basicConfig(
level=logging.INFO,
format=“%(levelname)s | %(message)s”,
)
log = logging.getLogger(_name_)

model = ChatAnthropic(model=“claude-sonnet-4-6”, temperature=0)

@tooltooltooltool
def spawn_agent(task_type: str, task_description: str) → str:
“”“Spawn a new specialized agent for any task dynamically.”“”
log.info(
“supervisor → spawn_agent | task_type=%r | task_description=%r”,
task_type,
task_description,
)

prompt_response = model.invoke(
    f"Write a short system prompt (2-3 lines) for an AI agent specialized in: {task_type}. "
    f"Task it needs to handle: {task_description}. "
    f"Return ONLY the system prompt, nothing else."
)
system_prompt = prompt_response.content

print(f"\\n\[SPAWNED\] {task_type} agent")
print(f"\[PROMPT\]  {system_prompt}\\n")

agent = create_deep_agent(
    name=f"{task_type}\_agent",
    tools=\[\],
    system_prompt=system_prompt,
)

result = agent.invoke(
    {"messages": \[{"role": "user", "content": task_description}\]}
)

return result\["messages"\]\[-1\].content

supervisor = create_deep_agent(
name=“supervisor”,
tools=[spawn_agent],
system_prompt=“”"You are a supervisor agent.

For EVERY task from the user:

  1. Decide the task_type in 1-2 words (e.g. analysis, research, coding, legal, etc.)
  2. Call spawn_agent(task_type, task_description)
  3. Return the result

on the basis of user input, you need to decide the task_type and task_description.
“”",
)

while True:
user_input = input("\nYou: ").strip()
if not user_input or user_input == “exit”:
break

result = supervisor.invoke(
    {"messages": \[{"role": "user", "content": user_input}\]}
)
print("result: ", result)
print(f"\\nSupervisor: {result\['messages'\]\[-1\].content}")
```

error: Bad request (400): Error code: 400 - {‘error’: {‘message’: 'litellm.BadRequestError: DatabricksException - {“error_code”:“BAD_REQUEST”,“message”:"{\“message\”:\“cache_control: Extra inputs are not permitted\”} with using deepagent sdk