LangGraph Production Connection Pooling Inquiry

Background

We’re building a production multi-agent system using LangGraph with PostgreSQL checkpointing and have encountered a critical issue with connection pool management that we’d like to discuss with the LangGraph team.

Our Architecture

We have a multi-agent system with the following setup:

```python

class BaseAgent(ABC):

“”“Base class for all LangGraph agents”“”

\_shared_pool = None  *# Class variable for shared pool*

\_pool_context = None  *# Store the context manager*

async def _init_(self, name: str, is_supervisor: bool = False):

# … initialization logic

async def init_async_pool(self):

“”“Initialize async connection pool and checkpointer”“”

if use_checkpointing:

self.checkpointer_db = AsyncPostgresSaver(BaseAgent._shared_pool)

await self.checkpointer_db.setup()

```

**Independent Agent Types with their own sub agents:**

- `SupervisorAgent` (analytics workflows)

- `ThreeDPCustomAgent` (3D printing workflows)

- `ProjectBodhiAgent` (project management workflows)

- `TrialAnalysisAgent` (trial analysis workflows)

- `FormulationAssistant` (formulation workflows)

All agents inherit from `BaseAgent` and share the same `BaseAgent._shared_pool` for LangGraph checkpointing.

## The Problem: Pool Recreation Race Condition

We’re experiencing a `psycopg_pool.PoolClosed: the pool ‘pool-1’ is already closed` error in production. Here’s what happens:

### Sequence of Events:

1. **Database connection timeout** occurs (infrastructure issue)

2. **Health check detects** the timeout in `_ensure_pool_health()`

3. **Pool recreation** is triggered via `_recreate_pool()`

4. **Old pool is closed** and new pool is created

5. **Existing agent instances** still reference the old, closed pool

6. **Subsequent requests** fail with `PoolClosed` error

### Code Flow:

```python

# Health check detects issue

async def _ensure_pool_health(self):

if not BaseAgent._shared_pool or BaseAgent._shared_pool.is_closed():

await self._recreate_pool() # ← Triggers recreation

# Pool recreation (PROBLEMATIC)

async def _recreate_pool(self):

# Close existing pool

if BaseAgent._pool_context:

await BaseAgent._pool_context._aexit_(None, None, None)

# Reset pool references

BaseAgent.\_shared_pool = None

BaseAgent.\_pool_context = None

# Create new pool

BaseAgent.\_pool_context = AsyncConnectionPool(...)

BaseAgent.\_shared_pool = *await* BaseAgent.\_pool_context.\__aenter_\_()

# PROBLEM: Existing agent instances still have:

# self.checkpointer_db = AsyncPostgresSaver(old_closed_pool)

```

## Questions for LangGraph Team

### 1. **Production Connection Pooling Best Practices**

- What are the recommended patterns for connection pooling in production LangGraph applications?

- Should all independent supervisor agents (sub-agents used teh parent db connection) in a multi-agent system share the same connection pool, or should each agent have its own pool?

- Are there any thread-safety considerations we should be aware of when sharing pools across multiple agent instances?

### 2. **Pool Recreation Strategy**

- When a connection pool needs to be recreated due to infrastructure issues (timeouts, network problems, etc.), what’s the recommended approach?

- How can we update existing `AsyncPostgresSaver` instances to use a new pool without breaking active agent instances?

- Is there a built-in mechanism in LangGraph to handle pool recreation gracefully?

### 3. **Agent Instance Management**

- How should we handle the scenario where one agent instance triggers pool recreation while other agent instances are actively serving requests?

- Is there a way to atomically update all existing agent checkpointer references when the underlying pool changes?

- Should we be tracking all active agent instances to update them when the pool changes?

### 4. **Error Handling Patterns**

- What’s the recommended error handling strategy when database connectivity issues occur?

- Should we implement retry logic, graceful degradation, or fail-fast approaches?

- Are there any built-in LangGraph mechanisms for handling transient database issues?

## Current Workarounds We’re Considering

### Option 1: Graceful Degradation

```python

async def restore_session(self, message: Dict[str, Any]) → Dict[str, Any]:

try:

    state = *await* self.graph.aget_state(config)

# … success logic

except PoolClosed:

# Return structured error response instead of retrying

return error_response

```

### Option 2: Agent Instance Tracking

```python

class BaseAgent:

\_active_instances = set()  *# Track all agent instances*

async def _recreate_pool(self):

# Recreate pool

# Update all tracked instances

for instance in BaseAgent._active_instances:

        instance.checkpointer_db = AsyncPostgresSaver(BaseAgent.\_shared_pool)

```

### Option 3: Pool Reference Wrapper

```python

class PoolReference:

def _init_(self):

self._pool = None

def get_pool(self):

return self._pool

def update_pool(self, new_pool):

self._pool = new_pool

```

## Production Environment Details

- **Deployment**: Docker containers in AWS

- **Database**: PostgreSQL with connection pooling

- **Load**: ~10 concurrent users expected

- **Agents**: 5 different agent types, each handling different workflows

- **Checkpointing**: Enabled for state persistence across sessions

- **Error Frequency**: Occurs during database connectivity issues (timeouts, network problems)

## Request for Guidance

We’d greatly appreciate guidance on:

1. **Official best practices** for production LangGraph connection pooling

2. **Recommended patterns** for handling pool recreation in multi-agent systems

3. **Built-in mechanisms** (if any) for graceful pool management

4. **Error handling strategies** for production environments

5. **Any upcoming features** or improvements related to connection pool management

hello would appreciate if someone in langgraph product support can share persepctive?

If we implement a ResilientAsyncPostgresSaver (same retry/reconnect semantics as proposed here [NEW] Implement ResilientPostgresSaver for improved error handling and connection retries by samirpatil2000 · Pull Request #4008 · langchain-ai/langgraph · GitHub) and have the saver acquire a fresh connection per operation from the existing async pool, it will address your issue. You won’t need to recreate the pool, so agents won’t hold stale references, and transient drops will be retried transparently. In short: resilient logic inside AsyncPostgresSaver + per‑call acquisition from the shared pool = no PoolClosed race.