How do you regression-test LangGraph apps after prompt/model changes?

metawake · March 12, 2026, 6:20pm

I keep running into this problem: I change a prompt or swap a model in my LangGraph app, and I have no easy way to tell what actually changed in the output - which steps ran differently, how token usage shifted, whether a tool call appeared or disappeared.

Checkpoints handle state persistence and time-travel within a single run, but they don’t help with **comparing two runs** side by side.

Right now I end up eyeballing outputs or writing throwaway diff scripts. Curious what others do:

1. Do you have a workflow for “did my graph break after this change?”

2. Do you compare runs structurally (steps, tokens, cost) or just check final output?

3. Would a lightweight tool for recording runs and diffing them be useful, or does LangSmith already cover this well enough for you?

I’ve been experimenting with a small open-source lib ([work-ledger]( GitHub - metawake/work-ledger: Agentic monitoring and benchmarking tool. Pluggable into LangChain, CrewAI, LangGraph, PydanticAI. · GitHub )) that wraps a compiled graph and records runs as structured artifacts you can diff - but I’m more interested in hearing how others approach this problem before building further.

Topic		Replies	Views
LangChain 1.0 Alpha – Feedback Wanted! Announcements	12	6383	October 14, 2025
How to update graph state while preserving interrupts? LangGraph python-help	35	2197	October 21, 2025
Debug issues during node transitions LangGraph self-hosted , python-help	4	461	October 22, 2025
`checkpoint_during=False` causes LangGraph Studio to crash Deployment	2	447	July 25, 2025
Experiments in LangGraph Studio Show “Completed” but No Runs Available Observability & Evals	4	361	November 12, 2025

How do you regression-test LangGraph apps after prompt/model changes?

Related topics