Help Needed: MongoDB Checkpoints Collection Growing Too Large

Nikfury · November 6, 2025, 5:05pm

Problem Description

I’m experiencing significant storage issues with the LangGraph MongoDB checkpointer. The checkpoints collection is growing far beyond expected size and consuming excessive storage.

Current Situation

Thread IDs: 16,000 conversations
Total Messages: 36,542 messages across all conversations
Checkpoint Records: 282,758 records
Storage Used: ~30 GB
Average Checkpoints per Thread: ~17.7 checkpoints per conversation
Average Checkpoints per Message: ~7.7 checkpoint records per message

This means instead of maintaining a reasonable number of checkpoints per conversation, the system is saving multiple checkpoints per conversation (and per message), leading to exponential growth.

Environment

Dependencies:

"@langchain/core": "^0.3.77"
"@langchain/langgraph": "^0.4.9"
"@langchain/langgraph-checkpoint-mongodb": "^0.1.1"
"@langchain/openai": "^0.6.16"

Database: Azure Cosmos DB for MongoDB (RU-based account)

Note: Using Cosmos DB RU account means storage costs scale with data size, making this 30GB checkpoint collection a significant cost concern beyond just performance.

Questions

Is this expected behavior? Should LangGraph be creating ~18 checkpoints per conversation on average?
Checkpoint Retention: Is there a built-in mechanism to limit the number of checkpoints per thread, or do I need to implement manual cleanup?
Best Practices: What’s the recommended approach for managing checkpoint growth in production environments with high conversation volume?
Configuration Options: Are there any configuration parameters in @langchain/langgraph-checkpoint-mongodb to control:
- Maximum checkpoints per thread
- Automatic cleanup of old checkpoints
- Checkpoint retention policies
Migration Concerns: If I need to clean up old checkpoints, will this break existing conversation threads or affect the ability to resume conversations?

What I’ve Considered

Manual cleanup scripts to delete old checkpoints while preserving the latest N checkpoints per thread
Moving to a different checkpointing strategy

Any guidance on the recommended approach would be greatly appreciated. Has anyone else encountered similar scaling issues with the MongoDB checkpointer?

Additional Context

This is for a production AI chat+code where users have iterative conversations with an AI agent to build full-stack applications. Each conversation can involve multiple turns, and we need to maintain conversation state but don’t necessarily need to keep every checkpoint indefinitely.

Thanks in advance for any help!

pawel-twardziak · November 6, 2025, 7:43pm

Hi @Nikfury

Is this expected?

Yes. LangGraph persists a checkpoint at every “super-step” of graph execution, not once per user message. In a typical agent graph, a single user turn can span several super-steps (e.g., input ingest, LLM/tool nodes, reducer/merge, finalization), so multiple checkpoints per message is expected. Additionally, each checkpoint can have multiple “writes” stored in a separate checkpoint_writes collection (interrupts, errors, scheduled tasks, per-channel writes), which increases “records per message.”

Is there built-in retention?

AFAIK, not in the MongoDB checkpointer. The implementation exposes put, putWrites, getTuple, list, and deleteThread, but does not include any max-per-thread, TTL, or pruning configuration. You’ll need to implement manual cleanup or wrap/extend the saver to add TTL fields.

Best practices to manage growth

Trim/summarize conversation state to shrink each checkpoint payload.

Use reducers to delete old messages or summarize history so fewer/lighter messages are saved in channel_values each step.

Periodic pruning job (recommended): keep the latest N checkpoints per thread and delete older ones from both checkpoints and checkpoint_writes.

Checkpoints are ordered by checkpoint_id (UUIDv6) which is time-sortable. The MongoDB saver itself sorts by checkpoint_id descending to fetch latest, so you can safely prune older IDs.

Optional TTL (Cosmos DB/MongoDB): viable only if a top-level Date field exists.

The current saver stores checkpoint.ts and metadata inside serialized blobs, so there’s no top-level Date field you can index for TTL out of the box. If you require TTL-based expiry, wrap/extend the saver to add an expiresAt (or createdAt) top-level field on both collections and create TTL indexes in Cosmos DB.

Configuration options in @langchain/langgraph-checkpoint-mongodb

There are no built-in options to cap per-thread checkpoints or auto-prune. The only deletion helper is deleteThread(threadId). Use a scheduled cleanup job or extend the saver for TTL/retention.

Migration concerns when deleting old checkpoints

Deleting old checkpoints does not break resuming from the latest checkpoint (that’s what getTuple returns by default). You will, however, lose the ability to “time-travel” or replay to older steps you delete.
To be safe, keep at least the last few checkpoints per thread (e.g., N=3–10), and always delete both from checkpoints and matching checkpoint_writes to keep data consistent. Avoid pruning when you have an active human-in-the-loop interrupt on a thread unless you retain the parent checkpoint for that interrupt.

Topic		Replies	Views
Checkpoint cleanup LangGraph cloud , python-help	5	527	March 3, 2026
Separate Long term memory and Checkpointing LangGraph intro-to-langgraph , python-help	1	492	September 29, 2025
Proposal: additional docs for implementing custom DB checkpointers or a guide on generic base checkpointer LangGraph self-hosted , product-feedback , python-help , feature-request	1	104	May 15, 2026
How to Prune Old Messages and Blobs with PostgresSaver to Manage Database Size? Hi team, I have a question regarding the intended usage of PostgresSaver in LangGraph, specifically around managing the size of the checkpointing database when implementing st LangGraph self-hosted , python-help	3	526	December 19, 2025
MESSAGE_COERCION_FAILURE Using Redis Checkpointer with langraph LangGraph python-help	2	569	September 28, 2025