Hello @formertheorist 
Q1: Is intermittent high TTFT after long idle expected for this model/profile/region?
Yes, and it is not a model-level cold start. The ~63–64 s spike is a well-documented infrastructure artifact, not Anthropic model initialization. Two compounding mechanisms explain both the magnitude and the intermittency:
Root cause 1: NAT Gateway idle-connection timeout (350 s)
AWS NAT Gateways, Interface VPC Endpoints, and NLBs all silently drop TCP connections idle for ≥ 350 seconds (AWS VPC troubleshooting docs). The remote end (Bedrock) is never notified; it still thinks the connection is alive. Your boto3 client then tries to reuse that dead connection.
Root cause 2: boto3 silent retry absorbs the dead-connection hang
When boto3 sends on a dead socket, the first attempt hangs until the read_timeout fires (default: 60 s), then an automatic retry succeeds in a few seconds. This is precisely why you see ~63–64 s = ~60 s timeout + ~3–4 s real model latency.
Why intermittent and not a strict threshold?
- The connection pool may hold multiple sockets of varying ages; sometimes a live one is picked first.
- If a request completes mid-idle (e.g., your 2 h run was actually faster than 350 s of internal inactivity), the clock resets.
- Cross-region inference routing to a different destination region can arrive over a fresh TCP path, bypassing the stale local socket entirely.
This exact pattern has been reported against ChatBedrockConverse in langchain-aws#502 and langchain-aws#819.
Q2: Recommended mitigations (layered, most impactful first)
1. Enable TCP keepalive on the boto3 client (highest impact)
This sends OS-level keepalive probes before 350 s, preventing NAT from dropping the connection. Supported natively since botocore merged PR #3140:
from botocore.config import Config
import boto3
bedrock_client = boto3.client(
"bedrock-runtime",
region_name="us-east-1",
config=Config(
tcp_keepalive=True, # prevents NAT idle-timeout drops
read_timeout=300, # must exceed your longest expected TTFT
connect_timeout=10,
retries={"max_attempts": 2, "mode": "standard"},
),
)
You also need the OS-level keepalive interval < 350 s on the host/pod. On Linux:
# keepalive probes start after 60 s idle, every 10 s, 6 probes before giving up
sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.tcp_keepalive_intvl=10
sysctl -w net.ipv4.tcp_keepalive_probes=6
For EKS, this can be set at the node group level or via a DaemonSet that applies sysctl.
2. Use streaming (astream / converseStream) instead of blocking invoke
Streaming connections exchange bytes continuously, so they never reach the 350 s idle window. This is the most resilient fix architecturally and also reduces TTFT by delivering first tokens as soon as they are generated:
async for chunk in llm.astream(messages):
process(chunk)
3. Upgrade langchain-aws to ≥ 1.2.3
A separate but related bug where streaming response bodies were not properly closed caused connections to accumulate in a bad state in the pool. Fixed in langchain-aws#858, released in 1.2.3.
4. Use performanceConfig: latency=optimized
Available on Claude Sonnet models via the Converse API. Reduces baseline TTFT but does not fix the stale-connection issue directly:
ChatBedrockConverse(
model_id="us.anthropic.claude-sonnet-4-6",
additional_model_request_fields={},
# pass through via client call kwargs
)
# or via boto3 directly:
bedrock_client.converse(
...,
performanceConfig={"latency": "optimized"},
)
5. Monitor with CloudWatch
Use the IdleTimeoutCount VPC metric to confirm NAT idle drops are the trigger, and bedrock:InvokeModel RetryAttempts field in ResponseMetadata to confirm boto3 retries are the amplifier.
Summary table
| Layer |
Fix |
Effort |
| Network |
tcp_keepalive=True + OS sysctl |
Low |
| SDK |
langchain-aws >= 1.2.3 |
Trivial |
| Architecture |
Switch to astream() |
Medium |
| Bedrock API |
performanceConfig: optimized |
Low |
| Observability |
CloudWatch IdleTimeoutCount + RetryAttempts |
Low |
The ~63–64 s spikes at hour-scale idleness are not a Bedrock or cross-region inference cold-start issue - they are a TCP connection management problem that the above mitigations eliminate entirely.