Help Needed: MongoDB Checkpoints Collection Growing Too Large

Hi @Nikfury

Is this expected?

Yes. LangGraph persists a checkpoint at every “super-step” of graph execution, not once per user message. In a typical agent graph, a single user turn can span several super-steps (e.g., input ingest, LLM/tool nodes, reducer/merge, finalization), so multiple checkpoints per message is expected. Additionally, each checkpoint can have multiple “writes” stored in a separate checkpoint_writes collection (interrupts, errors, scheduled tasks, per-channel writes), which increases “records per message.”

Is there built-in retention?

AFAIK, not in the MongoDB checkpointer. The implementation exposes put, putWrites, getTuple, list, and deleteThread, but does not include any max-per-thread, TTL, or pruning configuration. You’ll need to implement manual cleanup or wrap/extend the saver to add TTL fields.

Best practices to manage growth

  1. Trim/summarize conversation state to shrink each checkpoint payload.

Use reducers to delete old messages or summarize history so fewer/lighter messages are saved in channel_values each step.

  1. Periodic pruning job (recommended): keep the latest N checkpoints per thread and delete older ones from both checkpoints and checkpoint_writes.

Checkpoints are ordered by checkpoint_id (UUIDv6) which is time-sortable. The MongoDB saver itself sorts by checkpoint_id descending to fetch latest, so you can safely prune older IDs.

  1. Optional TTL (Cosmos DB/MongoDB): viable only if a top-level Date field exists.

The current saver stores checkpoint.ts and metadata inside serialized blobs, so there’s no top-level Date field you can index for TTL out of the box. If you require TTL-based expiry, wrap/extend the saver to add an expiresAt (or createdAt) top-level field on both collections and create TTL indexes in Cosmos DB.

Configuration options in @langchain/langgraph-checkpoint-mongodb

There are no built-in options to cap per-thread checkpoints or auto-prune. The only deletion helper is deleteThread(threadId). Use a scheduled cleanup job or extend the saver for TTL/retention.

Migration concerns when deleting old checkpoints

  • Deleting old checkpoints does not break resuming from the latest checkpoint (that’s what getTuple returns by default). You will, however, lose the ability to “time-travel” or replay to older steps you delete.

  • To be safe, keep at least the last few checkpoints per thread (e.g., N=3–10), and always delete both from checkpoints and matching checkpoint_writes to keep data consistent. Avoid pruning when you have an active human-in-the-loop interrupt on a thread unless you retain the parent checkpoint for that interrupt.