LangSmith Cloud Deployment Intermittent Issue

Hey,

We are observing intermittent issues with our LangSmith Hosted Cloud Deployment.

  • LangGraph API Version: 0.7.90
  • Deployment Mode: LangSmith Cloud
  • Deployment Type: Development
  • Runtime: Node 20

Below included stacktraces from the Server logs which suggest underlying infra errors bubbling up.

Is this a known issue and is there any mitigation or resolution available ?

Best


30/03/2026, 17:45:10 Closing gRPC client pool (5 clients)
30/03/2026, 17:45:10 Terminating JS graphs process
30/03/2026, 17:45:10 Shutting down remote graphs
30/03/2026, 17:45:10 Lifespan failed
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/usr/lib/python3.12/site-packages/httpx/_transports/default.py", line 394, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request
    raise exc from None
  File "/usr/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 236, in handle_async_request
    response = await connection.handle_async_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/httpcore/_async/connection.py", line 101, in handle_async_request
    raise exc
  File "/usr/lib/python3.12/site-packages/httpcore/_async/connection.py", line 78, in handle_async_request
    stream = await self._connect(request)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/httpcore/_async/connection.py", line 124, in _connect
    stream = await self._network_backend.connect_tcp(**kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/httpcore/_backends/auto.py", line 31, in connect_tcp
    return await self._backend.connect_tcp(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 113, in connect_tcp
    with map_exceptions(exc_map):
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/usr/lib/python3.12/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ConnectError: All connection attempts failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/langgraph_runtime_postgres/lifespan.py", line 178, in lifespan
    await graph.collect_graphs_from_env(True)
  File "/api/langgraph_api/graph.py", line 483, in collect_graphs_from_env
  File "/api/langgraph_api/js/remote.py", line 828, in wait_until_js_ready
  File "/usr/lib/python3.12/site-packages/httpx/_client.py", line 1768, in get
    return await self.request(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/httpx/_client.py", line 1540, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/httpx/_client.py", line 1629, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/httpx/_client.py", line 1694, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/httpx/_client.py", line 1730, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/httpx/_transports/default.py", line 393, in handle_async_request
    with map_httpcore_exceptions():
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/usr/lib/python3.12/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ConnectError: All connection attempts failed
30/03/2026, 17:44:32 Resolving graph my-workflow-v2
30/03/2026, 17:44:32 Resolving graph my-workflow
30/03/2026, 17:44:32 Starting graph loop
30/03/2026, 17:44:02 Successfully submitted metadata to LangSmith instance

--- (these are other instances of errors also observed)

30/03/2026, 15:29:04 Closing gRPC client pool (5 clients)
30/03/2026, 15:29:04 Checkpointer ingestion task cancelled. Draining queue.
30/03/2026, 15:29:04 Shutting down remote graphs
30/03/2026, 15:29:04 Terminating JS graphs process
30/03/2026, 15:29:04 Received SIGTERM. Exiting..
30/03/2026, 15:29:04 Finished server process [1]
30/03/2026, 15:29:04 Application shutdown complete.
30/03/2026, 15:29:04 Waiting for application shutdown.
30/03/2026, 15:29:04 Shutting down
30/03/2026, 15:28:58 Received SIGTERM. Exiting..
30/03/2026, 15:28:58 asyncio.exceptions.CancelledError
30/03/2026, 15:28:58     await getter
30/03/2026, 15:28:58   File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
30/03/2026, 15:28:58            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
30/03/2026, 15:28:58 Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/starlette/routing.py", line 645, in lifespan
    await receive()
  File "/usr/lib/python3.12/site-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
30/03/2026, 15:28:58 During handling of the above exception, another exception occurred:
30/03/2026, 15:28:58 SystemExit: 1
30/03/2026, 15:28:58   File "/api/langgraph_api/graph.py", line 512, in _handle_exception
30/03/2026, 15:28:58   File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
30/03/2026, 15:28:58   File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
30/03/2026, 15:28:58   File "uvloop/loop.pyx", line 476, in uvloop.loop.Loop._on_idle
30/03/2026, 15:28:58   File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
30/03/2026, 15:28:58   File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
30/03/2026, 15:28:58   File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
30/03/2026, 15:28:58   File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
30/03/2026, 15:28:58            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
30/03/2026, 15:28:58     return self._loop.run_until_complete(task)
30/03/2026, 15:28:58   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
30/03/2026, 15:28:58            ^^^^^^^^^^^^^^^^
30/03/2026, 15:28:58     return runner.run(main)
30/03/2026, 15:28:58   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
30/03/2026, 15:28:58 ERROR:    Traceback (most recent call last):
30/03/2026, 15:28:58 Entrypoint task finished
30/03/2026, 15:28:58 Checkpointer ingestion task cancelled. Draining queue.
30/03/2026, 15:28:58 Shutting down health and metrics server
30/03/2026, 15:28:58 asyncio.exceptions.CancelledError
30/03/2026, 15:28:58     await getter
30/03/2026, 15:28:58   File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
30/03/2026, 15:28:58            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
30/03/2026, 15:28:58 Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/starlette/routing.py", line 645, in lifespan
    await receive()
  File "/usr/lib/python3.12/site-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
30/03/2026, 15:28:58 During handling of the above exception, another exception occurred:
30/03/2026, 15:28:58 SystemExit: 1
30/03/2026, 15:28:58   File "/api/langgraph_api/graph.py", line 512, in _handle_exception
30/03/2026, 15:28:58   File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
30/03/2026, 15:28:58   File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
30/03/2026, 15:28:58   File "uvloop/loop.pyx", line 476, in uvloop.loop.Loop._on_idle
30/03/2026, 15:28:58   File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
30/03/2026, 15:28:58   File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
30/03/2026, 15:28:58   File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
30/03/2026, 15:28:58   File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
30/03/2026, 15:28:58            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
30/03/2026, 15:28:58     return self._loop.run_until_complete(task)
30/03/2026, 15:28:58   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
30/03/2026, 15:28:58            ^^^^^^^^^^^^^^^^
30/03/2026, 15:28:58     return runner.run(main)
30/03/2026, 15:28:58   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
30/03/2026, 15:28:58 Closing gRPC client pool (5 clients)
30/03/2026, 15:28:58 Terminating JS graphs process
30/03/2026, 15:28:58 Shutting down remote graphs
30/03/2026, 15:28:58 ERROR:    Traceback (most recent call last):
30/03/2026, 15:28:58 Successfully shutdown queue
30/03/2026, 15:28:58 Workers finished.
30/03/2026, 15:28:58 Queue task cancelled. Shutting down workers. Will terminate after 180s
30/03/2026, 15:28:58 Shutting down queue...
30/03/2026, 15:24:49 Successfully submitted metadata to LangSmith instance
30/03/2026, 15:24:49 HTTP Request: POST https://eu.api.smith.langchain.com/v1/metadata/submit "HTTP/1.1 204 No Content"
30/03/2026, 15:24:49 Successfully submitted metadata to LangSmith instance
30/03/2026, 15:24:49 HTTP Request: POST https://eu.api.smith.langchain.com/v1/metadata/submit "HTTP/1.1 204 No Content"
30/03/2026, 15:22:56 redis: 2026/03/30 14:22:56 pool.go:617: redis: connection pool: failed to dial after 3 attempts: dial tcp 192.168.112.124:6379: i/o timeout

Hi @mipestan

I think all three error patterns you’re seeing are expected behavior for Development deployments on LangSmith Cloud and may be a direct consequence of how development infrastructure is provisioned.

As stated in the Deploy to Cloud docs:

“Development deployments are meant for non-production use cases and are provisioned with minimal resources.”

“Production deployments can serve up to 500 requests/second and are provisioned with highly available storage with automatic backups.”

The fact that your development deployment runs on minimal rources explains why you’re hitting startup race conditions and the SIGTERM/Redis errors suggest the underlying infrastructure is recycling your pod -which is consistent with how cloud providers manage low-priority workloads imho

LangSmith cloud runs on Google Kubernetes Engine (GKE) using services like Cloud SQL PostreSQL and Cloud Memorystore (Redis) - Cloud Architecture

I’ll upgrade the environment and give it a try.

Thanks for the reply

Alright. Let us know whether that solves the issues.

Re-raising here the issue now with production deployment after a 3d w/o any issues.

Several errors surrounding Redis:

[ERROR] Background worker scheduler failed
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/langgraph_runtime_postgres/queue.py", line 312, in queue
    async for run, attempt, encryption_context in Runs.next(
  File "/api/langgraph_api/grpc/ops/runs.py", line 847, in next
  File "/usr/lib/python3.12/site-packages/grpc/aio/_call.py", line 327, in __await__
    raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "failed to get next run from queue"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"failed to get next run from queue", grpc_status:13}"
>
[ERROR] redis: 2026/04/02 16:52:44 pool.go:617: redis: connection pool: failed to dial after 5 attempts: dial tcp 192.168.103.108:6379: connect: connection refused
[ERROR] redis: 2026/04/02 16:52:43 pool.go:617: redis: connection pool: failed to dial after 5 attempts: dial tcp 192.168.103.108:6379: connect: connection refused
[ERROR] redis: 2026/04/02 16:52:42 pool.go:617: redis: connection pool: failed to dial after 5 attempts: dial tcp 192.168.103.108:6379: connect: connection refused

The deployment never actually reaches a stable state, it just keeps timing out on the Redis connection and rebooting.

Edit: after ~20min the deployment reached a stable state but this is unacceptable to our service uptime requirements. After ~10 consecutive or so runs it crashes again and the process repeats.

Is there anything we can do ensure the reliability of these infra upstream dependencies or are we essentially hands tied when using the managed cloud deployment ?

Thanks in advance

@mdrxy could you help?

Thanks, flagging to the team. @mipestan please flag again if you do not get a resolution by next Monday.

@mipestan can you provide a deployment id?

Deployment ID: ef16fe3c-14c4-4fea-9417-99a220436ec4