Prod Langsmith deployment raising internal redis errors

KurtzOmer · January 4, 2026, 10:30am

Hi team & community,

We are currently experiencing a critical issue with our production deployment. Our instance has been down for several hours, and we are seeking an urgent resolution.

Deployment Details:

- Deployment Name: mcp-playground-agent-scale

- Deployment ID: acb59f5b-e4b4-4c9c-be09-3e9cd8abbe1d

- Workspace ID: 5f6ecbbf-d894-480c-b184-9b677b3ed123

Latest Revision ID: 34c2f392-e53d-471e-ba20-1927d9c65c0f
Last Working Revision ID: 538ed1d8-3590-486b-814f-df722e4ca307

- URL: https://smith.langchain.com/o/5f6ecbbf-d894-480c-b184-9b677b3ed123/host/deployments/acb59f5b-e4b4-4c9c-be09-3e9cd8abbe1d?revisionPeek=7ffbe9c7-3294-44ae-9b69-4c1ec144a49c

Based on deployment server logs, the application is failing to start due to internal Redis connection timeouts. Specifically, we are seeing redis.exceptions.TimeoutError: Timeout connecting to server and asyncio.exceptions.CancelledError during the migration and lifespan startup phases.

Relevant Log Snippet:

1/4/2026, 9:32:05 AM Application startup failed. Exiting.

…

redis.exceptions.TimeoutError: Timeout connecting to server

1/4/2026, 9:31:55 AM Redis ping timed out

Could you please investigate this internal connectivity issue and advise on a resolution as soon as possible?

This deployment was well used for the past few months, without errors of this kind. Also, we didn’t make any change to the repo (didn’t push to main), or to the deployment. It just started to raise these errors.

This is the only deployment in our workspace raising these errors (we have one more production deployment and one more development deployment that work fine).

This is a crucial deployment for us, holding most of our threads, and we are seeking to fix & enable it back again.

Thanks in advance!

hari · January 4, 2026, 6:37pm

Hi @KurtzOmer ! Looks like your redis pod was stuck; on initial investigation, looks like this happened during a node pool outage we had on 12/26 at around 8pm pacific, where certain pods were unscheduled.

I just manually restarted your redis deployment, and looks like everything is back to being healthy now. Let me know if it is working - apologies for this major disruption!

KurtzOmer · January 5, 2026, 8:32am

Hi @hari , it works indeed, thanks!

Can you explain the root cause? And how come we are the ones that need to acknowledge Langchain team about the issue, while you should be able to find and fix these problematic cases?

We are pretty worried that it might just happen again, without no one taking care of it for us (or even informing us about it). Our main flow was broken for few hours, customers couldn’t use our platform, and support team (via email) didn’t respond - its all a pretty bad experience.

Topic		Replies	Views
LangGraph deployment fail because of a Redis connection issue Deployment	1	265	October 2, 2025
LangSmith Cloud Deployment Intermittent Issue Deployment cloud	8	156	April 3, 2026
Production deployment health check fails (600s timeout) — Development works with identical code LangSmith Product Help cloud	1	97	March 16, 2026
Deployment failing after LangChain/Graph JS v1 upgrade Deployment js-help	8	596	November 13, 2025
asyncio.exceptions.CancelledError on deployment when runtime reaches 5 minutes + LangSmith Product Help python-help	1	150	March 10, 2026

Prod Langsmith deployment raising internal redis errors

Related topics