First Bedrock call after idle is slow on TTFT (follow-ups in the same trace are fast)

Context

We’re running a LangGraph-based agent that calls Amazon Bedrock using the Converse API via langchain-aws ChatBedrockConverse. In LangSmith, traces show that most of the latency on the first model step is inside the Bedrock child span.

We experience first-invocation / post-idle latency. After idle periods, the first Bedrock invocation in a new interaction can have much higher time-to-first-token (TTFT) than nearby invocations.

Configuration

Bedrock Runtime region: us-east-1
Model / inference profile: us.anthropic.claude-sonnet-4-6 (US cross-region inference profile)
Client stack: boto3 Bedrock Runtime, Converse-style invocation through LangChain

Controlled experiment

We ran a idle-sweep test, where we invoked the deployed graph (same prod-like path) once per gap. Idle gaps tested: 0s, 30s, 60s, 5m, 10m, 30m, 1h, 2h, 3h. Prompt: System plus minimal one-line probe (±16k tokens). We do see a cold-start-like mode (~63–64s TTFT) at hour-scale idleness, but it appears intermittent/probabilistic rather than a strict deterministic threshold (e.g., 2h was fast while 1h and 3h were slow). Attach a screenshot/plot of the idle-sweep results.

Questions

  1. Is this “intermittent high TTFT after long idle” expected for this model/profile and region?
  2. Are there recommended mitigations from AWS side?

Hello @formertheorist :sunny:

Q1: Is intermittent high TTFT after long idle expected for this model/profile/region?

Yes, and it is not a model-level cold start. The ~63–64 s spike is a well-documented infrastructure artifact, not Anthropic model initialization. Two compounding mechanisms explain both the magnitude and the intermittency:

Root cause 1: NAT Gateway idle-connection timeout (350 s)

AWS NAT Gateways, Interface VPC Endpoints, and NLBs all silently drop TCP connections idle for ≥ 350 seconds (AWS VPC troubleshooting docs). The remote end (Bedrock) is never notified; it still thinks the connection is alive. Your boto3 client then tries to reuse that dead connection.

Root cause 2: boto3 silent retry absorbs the dead-connection hang

When boto3 sends on a dead socket, the first attempt hangs until the read_timeout fires (default: 60 s), then an automatic retry succeeds in a few seconds. This is precisely why you see ~63–64 s = ~60 s timeout + ~3–4 s real model latency.

Why intermittent and not a strict threshold?

  • The connection pool may hold multiple sockets of varying ages; sometimes a live one is picked first.
  • If a request completes mid-idle (e.g., your 2 h run was actually faster than 350 s of internal inactivity), the clock resets.
  • Cross-region inference routing to a different destination region can arrive over a fresh TCP path, bypassing the stale local socket entirely.

This exact pattern has been reported against ChatBedrockConverse in langchain-aws#502 and langchain-aws#819.


Q2: Recommended mitigations (layered, most impactful first)

1. Enable TCP keepalive on the boto3 client (highest impact)

This sends OS-level keepalive probes before 350 s, preventing NAT from dropping the connection. Supported natively since botocore merged PR #3140:

from botocore.config import Config
import boto3

bedrock_client = boto3.client(
    "bedrock-runtime",
    region_name="us-east-1",
    config=Config(
        tcp_keepalive=True,       # prevents NAT idle-timeout drops
        read_timeout=300,         # must exceed your longest expected TTFT
        connect_timeout=10,
        retries={"max_attempts": 2, "mode": "standard"},
    ),
)

You also need the OS-level keepalive interval < 350 s on the host/pod. On Linux:

# keepalive probes start after 60 s idle, every 10 s, 6 probes before giving up
sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.tcp_keepalive_intvl=10
sysctl -w net.ipv4.tcp_keepalive_probes=6

For EKS, this can be set at the node group level or via a DaemonSet that applies sysctl.

2. Use streaming (astream / converseStream) instead of blocking invoke

Streaming connections exchange bytes continuously, so they never reach the 350 s idle window. This is the most resilient fix architecturally and also reduces TTFT by delivering first tokens as soon as they are generated:

async for chunk in llm.astream(messages):
    process(chunk)

3. Upgrade langchain-aws to ≥ 1.2.3

A separate but related bug where streaming response bodies were not properly closed caused connections to accumulate in a bad state in the pool. Fixed in langchain-aws#858, released in 1.2.3.

4. Use performanceConfig: latency=optimized

Available on Claude Sonnet models via the Converse API. Reduces baseline TTFT but does not fix the stale-connection issue directly:

ChatBedrockConverse(
    model_id="us.anthropic.claude-sonnet-4-6",
    additional_model_request_fields={},
    # pass through via client call kwargs
)
# or via boto3 directly:
bedrock_client.converse(
    ...,
    performanceConfig={"latency": "optimized"},
)

5. Monitor with CloudWatch

Use the IdleTimeoutCount VPC metric to confirm NAT idle drops are the trigger, and bedrock:InvokeModel RetryAttempts field in ResponseMetadata to confirm boto3 retries are the amplifier.


Summary table

Layer Fix Effort
Network tcp_keepalive=True + OS sysctl Low
SDK langchain-aws >= 1.2.3 Trivial
Architecture Switch to astream() Medium
Bedrock API performanceConfig: optimized Low
Observability CloudWatch IdleTimeoutCount + RetryAttempts Low

The ~63–64 s spikes at hour-scale idleness are not a Bedrock or cross-region inference cold-start issue - they are a TCP connection management problem that the above mitigations eliminate entirely.