Merge or Die: a GitHub game for testing AI agents

I built something that I think is relevant for anyone working with agents, tools, and multi-step LLM workflows.

It’s called Merge or Die:
https://www.trajectly.dev/merge-or-die

The idea is to make a real problem more visible:

A lot of agent workflows can look fine at the final answer level, while the actual trajectory is broken.

Examples:

  • tool calls happen in the wrong order

  • a forbidden tool gets used

  • the agent skips a required step

  • behavior regresses even though the output still sounds reasonable

So I built a GitHub-native challenge around Trajectly to show this in a more concrete way.

When an agent fails, it doesn’t just say “failed.” It shows:

  • the exact witness step where it broke

  • which behavioral contract was violated

  • how to reproduce the run

  • and a minimized failing trace

I wanted to make agent testing feel less abstract and more developer-native, especially for people building with frameworks like LangChain where multi-step behavior matters a lot more than just the final text.

Would genuinely love feedback from people here building agent systems:
How are you testing trajectory-level behavior today, beyond just checking the final response?

https://www.trajectly.dev/merge-or-die