[Proposal] Solving "Silent Failures" with a Causal Precedence Evaluator for Agent Trajectories

Hi team & @hwchase17 — I’ve been exploring LangSmith’s trajectory evaluation docs, especially the distinction between:

  • exact / strict trajectory matching
  • unordered / any-order matching
  • LLM-as-judge over the full trajectory

That framing already captures something important: for agents, sequence is often part of correctness, not just a logging detail. The docs also note the tradeoff clearly: strict matching is deterministic but rigid, while LLM-as-judge is more flexible but less deterministic and requires an LLM call.

The gap I keep running into is a middle case:

  • strict is often too rigid, because there can be multiple valid trajectories
  • unordered is often too loose, because some tool calls are order-sensitive
  • LLM-as-judge is useful, but for privacy / security / compliance-sensitive flows I often want a deterministic evaluator first, then optionally an LLM evaluator on top.
    So I built a very small custom evaluator MRE using the standard LangSmith evaluate(...) flow plus custom evaluators that read the Run object and inspect child tool runs. This follows the documented pattern for evaluating intermediate steps / trajectories directly from the trace.

The scenario
I used a tiny order-sensitive workflow:

  • set_private
  • read_data
  • optional audit_access

The only causal rule is:

set_private must happen before read_data

Then I score three trajectories:

  1. set_private -> read_data
    safe
  2. set_private -> audit_access -> read_data
    also safe, but not an exact match
  3. read_data -> set_private
    unsafe, even if the final answer text looks successful

Why I think this is interesting

This seems like a case where:

  • exact match would reject trajectory #2
  • unordered match could accept trajectory #3
  • a deterministic causal / precedence evaluator would accept #1 and #2, and reject #3

In other words, there seems to be room for a middle layer between “exact sequence” and “any order”:

  • not exact path matching
  • not arbitrary any-order matching
  • not necessarily LLM judging
  • but deterministic partial-order / causal-constraint checking

Why this feels relevant to LangSmith

LangSmith already treats trajectory evaluation as a first-class part of agent evaluation, and the docs explicitly support custom code evaluators plus evaluators that inspect intermediate steps from traces. That makes this feel like a natural extension rather than a competing paradigm.

Minimal reference implementation

I put together a notebook-friendly / copy-paste-friendly Python example using:

  • Client.create_dataset(...)
  • @traceable(run_type="tool")
  • client.evaluate(...)
  • three evaluators:
    • trajectory_exact_match
    • trajectory_any_order_match
    • trajectory_logical_causality

The key evaluator is the last one: it checks required tools + precedence rules, rather than exact sequence or unordered set equivalence.

My question

Would LangSmith be open to a built-in or community-supported evaluator pattern like this for order-sensitive tool workflows?
I’m not proposing it as a replacement for LLM-as-judge — only as a deterministic complement for cases where tool order has causal meaning.
I’ve prepared a runnable MRE script using standard LangSmith evaluators to showcase this—you can find the gist here langsmith_order_sensitive_mre.py or I can post it below if there’s interest. @hwchase17