Testing multi-step chains — inconsistent results?

Hi all,

I’ve been experimenting with multi-step chains in LangChain, but sometimes my outputs are inconsistent when chaining multiple LLM calls. I was discussing with a colleague about ways to simulate or debug these flows safely, and they mentioned delta as a way to visualize step-by-step execution in a controlled environment.

Has anyone run into similar issues with chained prompts or context handling? What approaches do you use to ensure consistency across complex workflows?