Background
We’re building a production multi-agent system using LangGraph with PostgreSQL checkpointing and have encountered a critical issue with connection pool management that we’d like to discuss with the LangGraph team.
Our Architecture
We have a multi-agent system with the following setup:
```python
class BaseAgent(ABC):
“”“Base class for all LangGraph agents”“”
\_shared_pool = None *# Class variable for shared pool*
\_pool_context = None *# Store the context manager*
async def _init_(self, name: str, is_supervisor: bool = False):
# … initialization logic
async def init_async_pool(self):
“”“Initialize async connection pool and checkpointer”“”
if use_checkpointing:
self.checkpointer_db = AsyncPostgresSaver(BaseAgent._shared_pool)
await self.checkpointer_db.setup()
```
**Independent Agent Types with their own sub agents:**
- `SupervisorAgent` (analytics workflows)
- `ThreeDPCustomAgent` (3D printing workflows)
- `ProjectBodhiAgent` (project management workflows)
- `TrialAnalysisAgent` (trial analysis workflows)
- `FormulationAssistant` (formulation workflows)
All agents inherit from `BaseAgent` and share the same `BaseAgent._shared_pool` for LangGraph checkpointing.
## The Problem: Pool Recreation Race Condition
We’re experiencing a `psycopg_pool.PoolClosed: the pool ‘pool-1’ is already closed` error in production. Here’s what happens:
### Sequence of Events:
1. **Database connection timeout** occurs (infrastructure issue)
2. **Health check detects** the timeout in `_ensure_pool_health()`
3. **Pool recreation** is triggered via `_recreate_pool()`
4. **Old pool is closed** and new pool is created
5. **Existing agent instances** still reference the old, closed pool
6. **Subsequent requests** fail with `PoolClosed` error
### Code Flow:
```python
# Health check detects issue
async def _ensure_pool_health(self):
if not BaseAgent._shared_pool or BaseAgent._shared_pool.is_closed():
await self._recreate_pool() # ← Triggers recreation
# Pool recreation (PROBLEMATIC)
async def _recreate_pool(self):
# Close existing pool
if BaseAgent._pool_context:
await BaseAgent._pool_context._aexit_(None, None, None)
# Reset pool references
BaseAgent.\_shared_pool = None
BaseAgent.\_pool_context = None
# Create new pool
BaseAgent.\_pool_context = AsyncConnectionPool(...)
BaseAgent.\_shared_pool = *await* BaseAgent.\_pool_context.\__aenter_\_()
# PROBLEM: Existing agent instances still have:
# self.checkpointer_db = AsyncPostgresSaver(old_closed_pool)
```
## Questions for LangGraph Team
### 1. **Production Connection Pooling Best Practices**
- What are the recommended patterns for connection pooling in production LangGraph applications?
- Should all independent supervisor agents (sub-agents used teh parent db connection) in a multi-agent system share the same connection pool, or should each agent have its own pool?
- Are there any thread-safety considerations we should be aware of when sharing pools across multiple agent instances?
### 2. **Pool Recreation Strategy**
- When a connection pool needs to be recreated due to infrastructure issues (timeouts, network problems, etc.), what’s the recommended approach?
- How can we update existing `AsyncPostgresSaver` instances to use a new pool without breaking active agent instances?
- Is there a built-in mechanism in LangGraph to handle pool recreation gracefully?
### 3. **Agent Instance Management**
- How should we handle the scenario where one agent instance triggers pool recreation while other agent instances are actively serving requests?
- Is there a way to atomically update all existing agent checkpointer references when the underlying pool changes?
- Should we be tracking all active agent instances to update them when the pool changes?
### 4. **Error Handling Patterns**
- What’s the recommended error handling strategy when database connectivity issues occur?
- Should we implement retry logic, graceful degradation, or fail-fast approaches?
- Are there any built-in LangGraph mechanisms for handling transient database issues?
## Current Workarounds We’re Considering
### Option 1: Graceful Degradation
```python
async def restore_session(self, message: Dict[str, Any]) → Dict[str, Any]:
try:
state = *await* self.graph.aget_state(config)
# … success logic
except PoolClosed:
# Return structured error response instead of retrying
return error_response
```
### Option 2: Agent Instance Tracking
```python
class BaseAgent:
\_active_instances = set() *# Track all agent instances*
async def _recreate_pool(self):
# Recreate pool
# Update all tracked instances
for instance in BaseAgent._active_instances:
instance.checkpointer_db = AsyncPostgresSaver(BaseAgent.\_shared_pool)
```
### Option 3: Pool Reference Wrapper
```python
class PoolReference:
def _init_(self):
self._pool = None
def get_pool(self):
return self._pool
def update_pool(self, new_pool):
self._pool = new_pool
```
## Production Environment Details
- **Deployment**: Docker containers in AWS
- **Database**: PostgreSQL with connection pooling
- **Load**: ~10 concurrent users expected
- **Agents**: 5 different agent types, each handling different workflows
- **Checkpointing**: Enabled for state persistence across sessions
- **Error Frequency**: Occurs during database connectivity issues (timeouts, network problems)
## Request for Guidance
We’d greatly appreciate guidance on:
1. **Official best practices** for production LangGraph connection pooling
2. **Recommended patterns** for handling pool recreation in multi-agent systems
3. **Built-in mechanisms** (if any) for graceful pool management
4. **Error handling strategies** for production environments
5. **Any upcoming features** or improvements related to connection pool management