Hi team & @hwchase17 — I’ve been exploring LangSmith’s trajectory evaluation docs, especially the distinction between:
- exact / strict trajectory matching
- unordered / any-order matching
- LLM-as-judge over the full trajectory
That framing already captures something important: for agents, sequence is often part of correctness, not just a logging detail. The docs also note the tradeoff clearly: strict matching is deterministic but rigid, while LLM-as-judge is more flexible but less deterministic and requires an LLM call.
The gap I keep running into is a middle case:
strictis often too rigid, because there can be multiple valid trajectoriesunorderedis often too loose, because some tool calls are order-sensitiveLLM-as-judgeis useful, but for privacy / security / compliance-sensitive flows I often want a deterministic evaluator first, then optionally an LLM evaluator on top.
So I built a very small custom evaluator MRE using the standard LangSmithevaluate(...)flow plus custom evaluators that read theRunobject and inspect child tool runs. This follows the documented pattern for evaluating intermediate steps / trajectories directly from the trace.
The scenario
I used a tiny order-sensitive workflow:
set_privateread_data- optional
audit_access
The only causal rule is:
set_privatemust happen beforeread_data
Then I score three trajectories:
set_private -> read_data
safeset_private -> audit_access -> read_data
also safe, but not an exact matchread_data -> set_private
unsafe, even if the final answer text looks successful
Why I think this is interesting
This seems like a case where:
- exact match would reject trajectory #2
- unordered match could accept trajectory #3
- a deterministic causal / precedence evaluator would accept #1 and #2, and reject #3
In other words, there seems to be room for a middle layer between “exact sequence” and “any order”:
- not exact path matching
- not arbitrary any-order matching
- not necessarily LLM judging
- but deterministic partial-order / causal-constraint checking
Why this feels relevant to LangSmith
LangSmith already treats trajectory evaluation as a first-class part of agent evaluation, and the docs explicitly support custom code evaluators plus evaluators that inspect intermediate steps from traces. That makes this feel like a natural extension rather than a competing paradigm.
Minimal reference implementation
I put together a notebook-friendly / copy-paste-friendly Python example using:
Client.create_dataset(...)@traceable(run_type="tool")client.evaluate(...)- three evaluators:
trajectory_exact_matchtrajectory_any_order_matchtrajectory_logical_causality
The key evaluator is the last one: it checks required tools + precedence rules, rather than exact sequence or unordered set equivalence.
My question
Would LangSmith be open to a built-in or community-supported evaluator pattern like this for order-sensitive tool workflows?
I’m not proposing it as a replacement for LLM-as-judge — only as a deterministic complement for cases where tool order has causal meaning.
I’ve prepared a runnable MRE script using standard LangSmith evaluators to showcase this—you can find the gist here langsmith_order_sensitive_mre.py or I can post it below if there’s interest. @hwchase17