First Bedrock call after idle is slow on TTFT (follow-ups in the same trace are fast)

formertheorist · April 27, 2026, 11:43am

Context

We’re running a LangGraph-based agent that calls Amazon Bedrock using the Converse API via langchain-aws ChatBedrockConverse. In LangSmith, traces show that most of the latency on the first model step is inside the Bedrock child span.

We experience first-invocation / post-idle latency. After idle periods, the first Bedrock invocation in a new interaction can have much higher time-to-first-token (TTFT) than nearby invocations.

Configuration

Bedrock Runtime region: us-east-1
Model / inference profile: us.anthropic.claude-sonnet-4-6 (US cross-region inference profile)
Client stack: boto3 Bedrock Runtime, Converse-style invocation through LangChain

Controlled experiment

We ran a idle-sweep test, where we invoked the deployed graph (same prod-like path) once per gap. Idle gaps tested: 0s, 30s, 60s, 5m, 10m, 30m, 1h, 2h, 3h. Prompt: System plus minimal one-line probe (±16k tokens). We do see a cold-start-like mode (~63–64s TTFT) at hour-scale idleness, but it appears intermittent/probabilistic rather than a strict deterministic threshold (e.g., 2h was fast while 1h and 3h were slow). Attach a screenshot/plot of the idle-sweep results.

Questions

Is this “intermittent high TTFT after long idle” expected for this model/profile and region?
Are there recommended mitigations from AWS side?

keenborder786 · April 28, 2026, 10:10am

Hello @formertheorist

Q1: Is intermittent high TTFT after long idle expected for this model/profile/region?

Yes, and it is not a model-level cold start. The ~63–64 s spike is a well-documented infrastructure artifact, not Anthropic model initialization. Two compounding mechanisms explain both the magnitude and the intermittency:

Root cause 1: NAT Gateway idle-connection timeout (350 s)

AWS NAT Gateways, Interface VPC Endpoints, and NLBs all silently drop TCP connections idle for ≥ 350 seconds (AWS VPC troubleshooting docs). The remote end (Bedrock) is never notified; it still thinks the connection is alive. Your boto3 client then tries to reuse that dead connection.

Root cause 2: boto3 silent retry absorbs the dead-connection hang

When boto3 sends on a dead socket, the first attempt hangs until the read_timeout fires (default: 60 s), then an automatic retry succeeds in a few seconds. This is precisely why you see ~63–64 s = ~60 s timeout + ~3–4 s real model latency.

Why intermittent and not a strict threshold?

The connection pool may hold multiple sockets of varying ages; sometimes a live one is picked first.
If a request completes mid-idle (e.g., your 2 h run was actually faster than 350 s of internal inactivity), the clock resets.
Cross-region inference routing to a different destination region can arrive over a fresh TCP path, bypassing the stale local socket entirely.

This exact pattern has been reported against ChatBedrockConverse in langchain-aws#502 and langchain-aws#819.

Q2: Recommended mitigations (layered, most impactful first)

1. Enable TCP keepalive on the boto3 client (highest impact)

This sends OS-level keepalive probes before 350 s, preventing NAT from dropping the connection. Supported natively since botocore merged PR #3140:

from botocore.config import Config
import boto3

bedrock_client = boto3.client(
    "bedrock-runtime",
    region_name="us-east-1",
    config=Config(
        tcp_keepalive=True,       # prevents NAT idle-timeout drops
        read_timeout=300,         # must exceed your longest expected TTFT
        connect_timeout=10,
        retries={"max_attempts": 2, "mode": "standard"},
    ),
)

You also need the OS-level keepalive interval < 350 s on the host/pod. On Linux:

# keepalive probes start after 60 s idle, every 10 s, 6 probes before giving up
sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.tcp_keepalive_intvl=10
sysctl -w net.ipv4.tcp_keepalive_probes=6

For EKS, this can be set at the node group level or via a DaemonSet that applies sysctl.

2. Use streaming (`astream` / `converseStream`) instead of blocking `invoke`

Streaming connections exchange bytes continuously, so they never reach the 350 s idle window. This is the most resilient fix architecturally and also reduces TTFT by delivering first tokens as soon as they are generated:

async for chunk in llm.astream(messages):
    process(chunk)

3. Upgrade `langchain-aws` to ≥ 1.2.3

A separate but related bug where streaming response bodies were not properly closed caused connections to accumulate in a bad state in the pool. Fixed in langchain-aws#858, released in 1.2.3.

4. Use `performanceConfig: latency=optimized`

Available on Claude Sonnet models via the Converse API. Reduces baseline TTFT but does not fix the stale-connection issue directly:

ChatBedrockConverse(
    model_id="us.anthropic.claude-sonnet-4-6",
    additional_model_request_fields={},
    # pass through via client call kwargs
)
# or via boto3 directly:
bedrock_client.converse(
    ...,
    performanceConfig={"latency": "optimized"},
)

5. Monitor with CloudWatch

Use the IdleTimeoutCount VPC metric to confirm NAT idle drops are the trigger, and bedrock:InvokeModel RetryAttempts field in ResponseMetadata to confirm boto3 retries are the amplifier.

Summary table

Layer	Fix	Effort
Network	`tcp_keepalive=True` + OS sysctl	Low
SDK	`langchain-aws >= 1.2.3`	Trivial
Architecture	Switch to `astream()`	Medium
Bedrock API	`performanceConfig: optimized`	Low
Observability	CloudWatch `IdleTimeoutCount` + `RetryAttempts`	Low

The ~63–64 s spikes at hour-scale idleness are not a Bedrock or cross-region inference cold-start issue - they are a TCP connection management problem that the above mitigations eliminate entirely.

Topic		Replies	Views
Langgraph performance with ChatConverse LangGraph python-help	1	409	July 15, 2025
How to handle AWS Bedrock and Open AI model providers with the same graph LangGraph python-help	3	601	August 13, 2025
@langchain/aws v1.0 ChatBedrockConverse: ‘Maximum tokens exceeds model limit’ LangChain js-help	1	299	October 27, 2025
Langsmith deployment long model call turnaround time Deployment	1	116	December 16, 2025
LangChain V1 Agentic Flow on Bedrock Fails for Non-Claude Models Available in Bedrock (Llama4, Nova, GPT OSS) LangChain Academy intro-to-langgraph , python-help	0	429	November 18, 2025