Problem Description
I’m experiencing significant storage issues with the LangGraph MongoDB checkpointer. The checkpoints collection is growing far beyond expected size and consuming excessive storage.
Current Situation
- Thread IDs: 16,000 conversations
- Total Messages: 36,542 messages across all conversations
- Checkpoint Records: 282,758 records
- Storage Used: ~30 GB
- Average Checkpoints per Thread: ~17.7 checkpoints per conversation
- Average Checkpoints per Message: ~7.7 checkpoint records per message
This means instead of maintaining a reasonable number of checkpoints per conversation, the system is saving multiple checkpoints per conversation (and per message), leading to exponential growth.
Environment
Dependencies:
"@langchain/core": "^0.3.77"
"@langchain/langgraph": "^0.4.9"
"@langchain/langgraph-checkpoint-mongodb": "^0.1.1"
"@langchain/openai": "^0.6.16"
Database: Azure Cosmos DB for MongoDB (RU-based account)
Note: Using Cosmos DB RU account means storage costs scale with data size, making this 30GB checkpoint collection a significant cost concern beyond just performance.
Questions
-
Is this expected behavior? Should LangGraph be creating ~18 checkpoints per conversation on average?
-
Checkpoint Retention: Is there a built-in mechanism to limit the number of checkpoints per thread, or do I need to implement manual cleanup?
-
Best Practices: What’s the recommended approach for managing checkpoint growth in production environments with high conversation volume?
-
Configuration Options: Are there any configuration parameters in @langchain/langgraph-checkpoint-mongodb to control:
- Maximum checkpoints per thread
- Automatic cleanup of old checkpoints
- Checkpoint retention policies
-
Migration Concerns: If I need to clean up old checkpoints, will this break existing conversation threads or affect the ability to resume conversations?
What I’ve Considered
- Manual cleanup scripts to delete old checkpoints while preserving the latest N checkpoints per thread
- Moving to a different checkpointing strategy
Any guidance on the recommended approach would be greatly appreciated. Has anyone else encountered similar scaling issues with the MongoDB checkpointer?
Additional Context
This is for a production AI chat+code where users have iterative conversations with an AI agent to build full-stack applications. Each conversation can involve multiple turns, and we need to maintain conversation state but don’t necessarily need to keep every checkpoint indefinitely.
Thanks in advance for any help!
Hi @Nikfury
Is this expected?
Yes. LangGraph persists a checkpoint at every “super-step” of graph execution, not once per user message. In a typical agent graph, a single user turn can span several super-steps (e.g., input ingest, LLM/tool nodes, reducer/merge, finalization), so multiple checkpoints per message is expected. Additionally, each checkpoint can have multiple “writes” stored in a separate checkpoint_writes collection (interrupts, errors, scheduled tasks, per-channel writes), which increases “records per message.”
Is there built-in retention?
AFAIK, not in the MongoDB checkpointer. The implementation exposes put, putWrites, getTuple, list, and deleteThread, but does not include any max-per-thread, TTL, or pruning configuration. You’ll need to implement manual cleanup or wrap/extend the saver to add TTL fields.
Best practices to manage growth
- Trim/summarize conversation state to shrink each checkpoint payload.
Use reducers to delete old messages or summarize history so fewer/lighter messages are saved in channel_values each step.
- Periodic pruning job (recommended): keep the latest N checkpoints per thread and delete older ones from both
checkpoints and checkpoint_writes.
Checkpoints are ordered by checkpoint_id (UUIDv6) which is time-sortable. The MongoDB saver itself sorts by checkpoint_id descending to fetch latest, so you can safely prune older IDs.
- Optional TTL (Cosmos DB/MongoDB): viable only if a top-level Date field exists.
The current saver stores checkpoint.ts and metadata inside serialized blobs, so there’s no top-level Date field you can index for TTL out of the box. If you require TTL-based expiry, wrap/extend the saver to add an expiresAt (or createdAt) top-level field on both collections and create TTL indexes in Cosmos DB.
Configuration options in @langchain/langgraph-checkpoint-mongodb
There are no built-in options to cap per-thread checkpoints or auto-prune. The only deletion helper is deleteThread(threadId). Use a scheduled cleanup job or extend the saver for TTL/retention.
Migration concerns when deleting old checkpoints
-
Deleting old checkpoints does not break resuming from the latest checkpoint (that’s what getTuple returns by default). You will, however, lose the ability to “time-travel” or replay to older steps you delete.
-
To be safe, keep at least the last few checkpoints per thread (e.g., N=3–10), and always delete both from checkpoints and matching checkpoint_writes to keep data consistent. Avoid pruning when you have an active human-in-the-loop interrupt on a thread unless you retain the parent checkpoint for that interrupt.