Production deployment health check fails (600s timeout) — Development works with identical code

Problem

Production deployment health check times out after 600 seconds, but Development deployment succeeds with identical code and config.

  • Deployment: prod-agent
  • Workspace: Workspace 1
  • LangGraph API version: 0.7.72
  • Banner: “This deployment is using our new and improved architecture”

What works

  • Development deployment deploys successfully with full config (auth, custom HTTP app, middleware)
  • Server starts cleanly, warmup completes in ~8s, total startup ~28-42s
  • /ok endpoint is served and responsive
  • No crashes, no restarts

What fails

  • Production deployment: “Timeout: New revision health check did not succeed after 600 seconds”
  • Even with ALL custom code removed (no auth, no middleware, no custom HTTP app, single graph), Production still fails
  • Redis connection errors seen in Production logs:
    redis: connection pool: failed to dial after 5 attempts: dial tcp 192.168.28.161:6379: connect: connection refused
    redis: connection pool: failed to dial after 1 attempts: dial tcp 192.168.28.161:6379: i/o timeout

What we tested

  1. Removed custom auth — still fails on Prod
  2. Removed custom HTTP app/middleware — still fails on Prod
  3. Removed all graphs except one — still fails on Prod
  4. Stripped to hello_world graph with zero deps — still fails on Prod
  5. Pinned base image to 0.7.69 — still fails on Prod
  6. Created brand new Production deployment — still fails
  7. Development deployment with full config — works

Server logs (Production)

Server starts normally, all health checks served, but revision never marked healthy:
Application started up in 25.530s
Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Then Redis fails:
redis: connection pool: failed to dial after 5 attempts: connect: connection refused

Request

The Production deployment infrastructure appears to have a Redis provisioning issue on the “new architecture.” Can you investigate the Redis connectivity for our Production deployment?

Hi @aadedewe_epoch

We did have an incident that we posted to our status page which lines up with the timeout error you reported:

If the issue is persisting please let us know.

Best,
Chad