I built something that I think is relevant for anyone working with agents, tools, and multi-step LLM workflows.
It’s called Merge or Die:
https://www.trajectly.dev/merge-or-die
The idea is to make a real problem more visible:
A lot of agent workflows can look fine at the final answer level, while the actual trajectory is broken.
Examples:
-
tool calls happen in the wrong order
-
a forbidden tool gets used
-
the agent skips a required step
-
behavior regresses even though the output still sounds reasonable
So I built a GitHub-native challenge around Trajectly to show this in a more concrete way.
When an agent fails, it doesn’t just say “failed.” It shows:
-
the exact witness step where it broke
-
which behavioral contract was violated
-
how to reproduce the run
-
and a minimized failing trace
I wanted to make agent testing feel less abstract and more developer-native, especially for people building with frameworks like LangChain where multi-step behavior matters a lot more than just the final text.
Would genuinely love feedback from people here building agent systems:
How are you testing trajectory-level behavior today, beyond just checking the final response?