How do you regression-test LangGraph apps after prompt/model changes?

I keep running into this problem: I change a prompt or swap a model in my LangGraph app, and I have no easy way to tell what actually changed in the output - which steps ran differently, how token usage shifted, whether a tool call appeared or disappeared.

Checkpoints handle state persistence and time-travel within a single run, but they don’t help with **comparing two runs** side by side.

Right now I end up eyeballing outputs or writing throwaway diff scripts. Curious what others do:

1. Do you have a workflow for “did my graph break after this change?”

2. Do you compare runs structurally (steps, tokens, cost) or just check final output?

3. Would a lightweight tool for recording runs and diffing them be useful, or does LangSmith already cover this well enough for you?

I’ve been experimenting with a small open-source lib ([work-ledger]( GitHub - metawake/work-ledger: Agentic monitoring and benchmarking tool. Pluggable into LangChain, CrewAI, LangGraph, PydanticAI. · GitHub )) that wraps a compiled graph and records runs as structured artifacts you can diff - but I’m more interested in hearing how others approach this problem before building further.