Built llmsessioncontract on AgentMiddleware: runtime enforcement of tool-call protocols — feedback wanted

Hi all, sharing a third-party package built on top of AgentMiddleware, plus some empirical data, looking for feedback before pushing it further.

The gap I’m trying to close

create_agent lets you specify what tools the agent has, but not what order it must call them in, or what other events must happen between calls (user approval, options presented, etc.). For most agents that’s fine. For agents that touch payments, bookings, or anything irreversible, it isn’t - “agent forgot the approval step” becomes a real incident class.

What I built

llmsessioncontract: a runtime monitor based on session-type theory. The LangChain integration (llmsessioncontract[langchain], v0.3.1) is a drop-in AgentMiddleware subclass that:

  • Defines the protocol as an explicit FSM (ProtocolFSM + Transition), with optional per-edge guard and action callbacks
  • Tool refs are derived from the @tool callables (ref(search) - no magic strings)
  • Mixes tool-call events (fired automatically by wrap_tool_call) with non-tool events like !PresentOptions or ?UserApproval (fired explicitly by the orchestrator via monitor.transition_event(...))
    • On violation, the user-supplied on_violation callback decides whether to log, raise, or surface a ToolMessage(status="error") so the agent self-corrects on its next turn
from llmcontract.langchain import (
      ProtocolFSM, Transition, ProtocolMonitor,
      ProtocolEnforcerMiddleware, ref,
  )                                                                                                                                                  
  fsm = (
      ProtocolFSM(initial="idle")
      .add_transition(Transition(source="idle", tool=search_ref, phase="send", target="searching"))
      .add_transition(Transition(source="searching", tool=search_ref, phase="recv", target="results")) 
      .add_transition(Transition(source="results", phase="send", target="presented",  
                                 event_label="PresentOptions"))
      .add_transition(Transition(source="presented", phase="recv", target="approved", 
                                 event_label="UserApproval"))
      .add_transition(Transition(source="approved", tool=book_ref, phase="send", target="booking"))
      .add_transition(Transition(source="booking", tool=book_ref, phase="recv", target="done"))
      .mark_terminal("done")
  )
middleware = ProtocolEnforcerMiddleware(monitor=monitor, tool_refs=[...]).middleware agent = create_agent(model=..., tools=[...], middleware=[middleware])

Repo: GitHub - chrisbartoloburlo/llmcontract: Runtime monitor for LLM agent interaction protocols based on session type theory · GitHub

Empirical data
I ran a small study on a flight-booking protocol (10 tasks × 3 models × 2 trials = 60 trajectories) with create_agent driving real ChatAnthropic against the FSM above:

Model violation rate
Claude Haiku 4.5 15%
Claude Sonnet 4.6 10%
Claude Opus 4.7 0%

A separate study on Playwright MCP showed the opposite gradient - larger models took more shortcuts. Different protocols, different “skip” semantics, but the monitor surfaces both. Full breakdown in the repo’s examples/langchain_booking/reports/findings.md.

What I’d like feedback on

  1. Is AgentMiddleware the right extension point for this? I read it as exactly what you’d want - wrap_tool_call for tool events, orchestrator-side firing for everything else -but I’d love to hear if there’s a better factoring I missed.
  2. Should non-tool events flow through middleware too? Right now !PresentOptions (agent text replies) and ?UserApproval (projected user replies) are fired by the orchestrator. I considered hooking wrap_model_call to fire !PresentOptions automatically, but that couples the middleware to text-reply semantics that aren’t always what the protocol means. Curious how others have handled this.
  3. Linking from langchain docs - would it be appropriate to add a pointer from the AgentMiddleware page to this and similar third-party middlewares as ecosystem examples? Happy to draft a PR to a docs page if there’s a natural home for it.

Not asking for a PR into the main repo - per the contributing guide, third-party integrations stay on PyPI, which is where this lives. Just genuinely want eyes on the design before I scale up the case studies.

Thanks!

Feedback on llmsessioncontract (This is a complete subjective analysis and people can disagree)

Question 1: Is AgentMiddleware the right extension point?

Yes, for the most part, but with one important gap.

Looking at the actual hooks available:

  • wrap_tool_call / awrap_tool_call — ideal for the send/recv tool phases. This is exactly right.
  • before_agent / after_agent — useful for initializing and finalizing FSM state per session.
  • wrap_model_call — available but tricky for your non-tool events (more on this below).

The gap is FSM state persistence. The AgentMiddleware base class exposes a state_schema class attribute , a custom AgentState subclass, which gets checkpointed by LangGraph automatically. If your ProtocolFSM’s current state is stored as a Python instance attribute on the middleware object rather than in state_schema, it will be lost on any graph resumption (e.g., after a HumanInTheLoopMiddleware interrupt). This is a significant correctness risk for the approval-gate use case you’re targeting, since those agents almost always use LangGraph checkpointing. You should store the FSM’s current state node in a field on a custom state_schema class.

A related concern: if callers reuse the same ProtocolEnforcerMiddleware instance across concurrent agent invocations, the FSM state becomes a race condition. The middleware should be instantiated per-agent-run, or the FSM state must be keyed by thread_id from the runtime context.


Question 2: Should non-tool events flow through middleware too?

The manual monitor.transition_event(...) approach is more correct than auto-firing from wrap_model_call, for the reason you stated, but there is a middle path worth considering.

The after_model hook (available on AgentMiddleware) receives the full AgentState after every model call. The last message in state["messages"] will be the AIMessage the model just produced. You could inspect ai_msg.content there to detect !PresentOptions — but the semantic fragility problem remains. Unless your protocol uses structured output (i.e., response_format) to force the model to emit a typed PresentOptionsResponse, auto-detection is brittle.

A cleaner design would be to use wrap_model_call only when a response_format is in play. When the model is constrained to emit a typed structured response, you can detect !PresentOptions reliably from model_response.structured_response rather than free-form text parsing.

For UserApproval, look at how HumanInTheLoopMiddleware works in this codebase, it uses langgraph.types.interrupt() inside wrap_tool_call to pause execution and wait for a human decision. That is a first-class LangGraph primitive for exactly this event class. You could fire UserApproval automatically by hooking into before_model (i.e., “before the model’s next turn, require approval”) rather than having the orchestrator call it explicitly. This would make it impossible for the orchestrator to “forget” the approval step, which seems to be the core incident class you’re targeting.


Question 3: Linking from LangChain docs

The contributing guide is explicit: third-party integrations stay on PyPI, not in the main repo. A PR to add a pointer in the AgentMiddleware docs page is reasonable as a community example, but the bar will be high, the team will likely want the library to be stable and the use case to be clearly general before adding a pointer from first-party docs. Please check Publish an integration - Docs by LangChain

Genuinely useful review - thank you. You named three issues; here’s what I did with each, in v0.4.0 (just shipped to PyPI: Client Challenge).

1. State persistence - fixed

You were right: the 0.3.x middleware stored FSM state on the ProtocolMonitor instance, which gets lost on any checkpoint rehydration. Worse, the booking demo I shipped is exactly the approval-gate use case that breaks first. Replaced ProtocolEnforcerMiddleware with CheckpointedProtocolMiddleware, which uses a ProtocolState(AgentState) schema:

class ProtocolState(AgentState):
    fsm_state: str
    fsm_trace: list[str]

State is checkpointed automatically and keyed by thread_id, so the concurrency race goes away too. Reads now happen via request.state in wrap_tool_call and the AgentState in before_model / after_model; writes flow back through Command(update={...}).

2. The interrupt() redesign - done, and it’s the biggest change

Your point about the orchestrator being able to “forget” the gate was the most important sentence in your review. Added an interrupt: bool flag on Transition:

.add_transition(Transition(
    source="search_done", phase="recv", target="approved",
    event_label="UserApproval", interrupt=True,   # framework-enforced
))

When the FSM enters the source state, before_model calls langgraph.types.interrupt(payload) automatically, the orchestrator cannot bypass it. The resume value drives the transition; for ambiguous source states with multiple interrupt transitions, the resume payload uses an explicit event_label to disambiguate. This moved the booking example from “5-line orchestrator dance with a known leak” to “framework-enforced approval gate”.

3. Structured-output detection - added

Added match_structured_response=<type> on Transition. When the agent uses response_format=Type and the FSM’s source state has a matching transition, after_model fires it deterministically based on isinstance(state["structured_response"], type). Skipped the text-pattern auto-detection path entirely per your warning about brittleness.

Linking from docs

Noted on the bar, I’ll let 0.4.0 stabilise before raising it again. The HITL-aware interrupt integration in particular needs a few real users before I’d argue for a docs pointer in good faith.

Two questions on the new design, if you have a minute:

  1. The middleware suspends in before_model for interrupt=True transitions. HumanInTheLoopMiddleware does it from after_model on the AIMessage’s tool calls. Different surface, different semantics - before_model interrupts apply to FSM state, not to any specific tool call. Does that factoring look right, or is there a reason HITL went the other way?
  2. For multiple interrupt=True transitions sharing one source state (legitimate when a state has multiple human-driven outcomes, “approve” / “modify” / “deny”), the resume value carries an explicit event_label. Is there a more idiomatic way to model “branching human choice” that I should mirror instead?

Thanks again for engaging with this.

Interesting direction — especially for irreversible flows. One design check I’d add is to treat non-tool events as first-class auditable protocol events even if they are emitted by the orchestrator/UI rather than middleware.

The hard part is not just “did the tool calls happen in order?” but “what exactly was shown, who/what approved it, and can we reconstruct that boundary later?” Anthropic’s Project Deal is a useful proof point here: once agents are buying/negotiating on behalf of humans, consent boundaries and receipts stop being nice-to-have plumbing.

For the LangChain factoring, your current split feels reasonable to me: middleware enforces tool edges, while a small generic event sink lets the app emit PresentOptions/UserApproval without coupling protocol semantics to arbitrary model text. If this becomes a docs example, a payment/booking walkthrough with the audit trail visible would probably make the value very concrete.

They are genuinely different, and your before_model placement is correct for what you’re doing. The distinction is semantic, not arbitrary.

HumanInTheLoopMiddleware.after_model fires because it needs to see the AIMessage.tool_calls the model just emitted, the interrupt payload contains those concrete tool call args, the human reviews them, and the resume value either approves/edits/rejects those specific calls before they reach the tool node. It is tool-call-level approval: review what the model decided to do.

Your before_model interrupt is session-level approval: enforce that the agent cannot begin its next reasoning turn until the protocol says it may. By the time before_model fires, the previous tool results are already in state. The FSM knows the agent just received search results and is in “presented” state. Suspending at before_model means the agent cannot form the next action at all, it cannot autonomously book, until the human resumes. That is stricter and more correct for your use case.

If you had put the interrupt in after_model instead, you’d fire after the model’s text reply (after it says “here are your results”), which loses one full reasoning turn. The model could have emitted a booking tool call in that turn before you intercepted it. before_model closes that window entirely.

One concrete consequence to document: because before_model fires at the start of the next model turn, there is a one-turn lag between the triggering event (FSM entering “presented”) and the interrupt. Any tool calls emitted in the turn that drove the FSM into “presented” will already have executed before the human is asked. This is by design; the search happens, results are delivered, then the gate fires. Worth making explicit in your docs so users don’t expect the gate to intercept the search itself.

It works, but it has one practical weakness: event_label is a string, so a typo or an integration that returns an unrecognized label silently fails to match any transition. You end up in an FSM dead-end with no diagnostic. HumanInTheLoopMiddleware avoids this with a typed discriminated union:

Decision = ApproveDecision | EditDecision | RejectDecision | RespondDecision
# each has a Literal `type` field

The more idiomatic approach would mirror that pattern. Instead of a raw event_label string in the resume payload, define a typed choice type per branching state:

class ProtocolResume(TypedDict):
    event_label: Literal["approve", "modify", "deny"]
    # any branch-specific payload, e.g. modified args for "modify"
    payload: NotRequired[dict[str, Any]]

Then in your FSM’s interrupt handler, match on resume["event_label"] and raise ProtocolViolationError immediately if the value is not in the set of valid labels for that source state. This gives you the same string-based dispatch you already have, but with early failure and a clear error rather than a silent dead-end.

The deeper LangGraph-idiomatic answer is to look at how Command(goto=...) interacts with graph routing. If your branching human choices lead to meaningfully different graph paths (e.g., “modify” loops back to the search node, while “deny” goes to a cancellation node), you could model each branch as a separate Command(goto="<node>") returned from before_model, skipping FSM states entirely for those branches. But this only makes sense if the branches are graph-level divergences. If all three outcomes (“approve”, “modify”, “deny”) stay within the same agent loop and only differ in FSM state, your event_label dispatch is the right level — just add the typed guard.

One other thing worth noting: if the ProtocolState.fsm_trace already records all transitions, you get an audit log of every interrupt and its resolution for free, which is exactly the evidence you’d want in a payment/booking incident review. That’s a good default to have in.

Again this is my subjective analysis and community can disagree!!! But If I was able to provide you with some useful feedback, I would appreciate a heart :wink: