Hello @simanggu Welcome to Langchain Community
Re: Would pre-inference routing help long-context agent workflows?
Short answer: yes, this is a sensible place to live, and LangChain v1 is actually built around exactly this assumption - the agent loop is a sequence of middleware steps wrapped around the model and tool calls, not a monolith. So a commit/hold/block controller like yours maps cleanly onto the existing extension points.
I took a look at the repo. The core mechanism (thesis check → slot carry → recovery recenter → risk gate → decision) is well-scoped, and the 100-turn benchmark showing 18 full executions vs 100 baseline with preserved thesis coherence (4.97 vs 4.52) is much more credible than a generic “we saved 83% tokens” claim. A few thoughts on each of your questions:
Yes:
-
before_model / after_model - cheap pre/post hooks around each model call. This is the natural home for your commit/hold/block decision. before_model can return a state patch and a jump_to of "tools" | "model" | "end" (declared via hook_config(can_jump_to=...) or the @before_model(can_jump_to=...) decorator). That’s exactly the “skip the big model” (commit with mini) or “pause and ask for clarification” (hold) or “refuse execution” (block) behavior you want.
-
wrap_model_call / awrap_model_call — wraps the model call itself with access to a ModelRequest you can mutate (swap model, swap messages, change tool_choice, change tools). This is where you’d route “commit” decisions to mini vs standard models, or where you’d implement the “block” veto by raising before handler(request) is called.
-
wrap_tool_call - same pattern, but around tools. Useful if “block” should also veto unsafe tool invocations (e.g., destructive file changes, API-breaking edits), not just model calls.
So your three verbs map roughly:
- commit (mini) →
wrap_model_call swaps request.model to a lightweight model and calls handler(request).
- commit (standard) →
wrap_model_call lets the standard model proceed as-is.
- hold →
before_model returns {"jump_to": "end", "messages": [AIMessage(content="...clarification request...")]}. Or compose with HumanInTheLoopMiddleware (InterruptOnConfig) if you want a real interrupt-and-resume flow instead of just emitting text.
- block →
before_model or wrap_tool_call raises or returns a refusal message. Pairs naturally with existing middlewares like PIIMiddleware (blocks PII) and ToolCallLimitMiddleware (blocks runaway tool use).
Slot carry + recovery recenter are really context management strategies, which means they overlap with:
SummarizationMiddleware - compresses message history when it gets long. Your “recovery recenter” is conceptually “detect drift and re-anchor around the original thesis,” which is a selective summarization triggered by your controller rather than length-based.
ContextEditingMiddleware (with ClearToolUsesEdit) - removes stale tool I/O from history. Your slot carry is the inverse: “keep these N key facts in every turn’s context.”
If you build it as a single JudgmentControlMiddleware(AgentMiddleware) that:
- In
before_model, runs your lightweight classifier on state["messages"] and writes state["jcm_decision"] (use a TypedDict extension of AgentState, that’s the documented pattern).
- For hold or block, returns
{"jump_to": "end", "messages": [...]} (declare can_jump_to=["end"]).
- For commit, returns nothing from
before_model and in wrap_model_call swaps request.model to “mini” or lets “standard” proceed.
- For recovery recenter, patches
state["messages"] (drop stale turns, re-inject thesis anchor) before the model call - or delegates to SummarizationMiddleware rather than reimplementing.
…then a v1 user can drop it into create_agent(..., middleware=[JudgmentControlMiddleware(...), SummarizationMiddleware(...)]) in one line, which is the bar for “would I actually try this.”
Your 100-turn benchmark (18 full executions vs 100 baseline, 4.97 vs 4.52 thesis score, −85% latency, −85% tokens) is already much better than generic “we saved tokens” claims. To make it land harder with the LangChain community:
-
Compare against the honest baseline: create_agent with SummarizationMiddleware + ContextEditingMiddleware(ClearToolUsesEdit) + ToolCallLimitMiddleware is what people are already running. If you beat that by 20–30% on tokens/latency at matched thesis coherence, that’s a real result. Beating naive “send the whole transcript every turn” isn’t surprising.
-
Use a public, reproducible task: τ-bench / τ²-bench (airline, retail), SWE-bench Verified subsets, or the LangChain-published agent eval traces are artifacts reviewers can re-run. A proprietary 100-turn transcript is hard to validate.
-
Report task success, not just tokens: “−85% tokens” is meaningless if success rate dropped. The interesting metric is tokens-per-successful-trajectory and latency-per-successful-trajectory at matched success rate. Add confidence intervals over ≥3 seeds.
-
Define “thesis score: 4.97” precisely: Who/what is the judge, what’s the rubric, what’s the inter-rater agreement? If it’s an LLM-as-judge, say which model and publish the prompt, otherwise that number reads as a vibe.
-
Publish controller overhead: The controller itself is a lightweight call, but people will immediately ask “what’s the latency and token cost of the judgment layer?” Show the break-even point - i.e., “above N turns of context, the controller pays for itself.” That single chart is probably the most persuasive artifact you could ship.
I’d start by composing on top of these instead of replacing them - your controller is a policy and they are the mechanisms it should drive:
-
SummarizationMiddleware - your “recovery recenter” is essentially “trigger summarization now, on this slice, around this thesis anchor.” Don’t reimplement the summarization loop; call into it or patch state["messages"] and let it run.
-
ContextEditingMiddleware with ClearToolUsesEdit - drop stale tool I/O when you decide a turn is safe and doesn’t need that history. Your slot carry is the inverse (keep key facts), but they compose cleanly.
-
ModelFallbackMiddleware / wrap_model_call, your “commit (mini)” vs “commit (standard)” is literally the model-swap pattern. Use request.override(model=mini_model) inside wrap_model_call.
-
HumanInTheLoopMiddleware (InterruptOnConfig) - your “hold” is much more useful if it can actually interrupt and resume, not just emit a clarification message and hope the user responds.
-
ToolCallLimitMiddleware and PIIMiddleware - your “block” overlaps with these; if your controller is going to veto, it should at least respect/surface the same signals (e.g., if PIIMiddleware would have blocked, your controller should show that in the reason).
Runtime security angle:
The “copied-session detection, replay resistance, permission-aware execution gates” you mention in the README is interesting but underspecified. If you mean “detect when state was forked/copied and refuse to execute because permissions may not transfer,” that’s a novel concern I haven’t seen middleware handle before. But it’s also the kind of thing that needs a threat model doc, because reviewers will immediately ask “what attack does this stop?” If it’s more about “detect when the agent’s internal state is inconsistent with the user’s intent,” that’s really just another flavor of drift detection (which is covered by the thesis check).
One non-obvious gotcha:
You say “Skipped does not mean ignored. It means low-risk turns were safely handled without full expensive execution.” That’s great framing, but you need to show how they were handled. Did the controller synthesize a response itself? Did it delegate to a mini model? Did it skip the model call entirely and return a cached/templated response? That distinction matters for the integration story - if you’re synthesizing responses outside the model, that’s a before_model + jump_to="end" + inject AIMessage. If you’re routing to a mini model, that’s wrap_model_call with model swap. The middleware API supports both, but they’re different hooks.
Bottom line:
This is a good direction. The LangChain v1 middleware surface was designed for exactly this kind of controller, and if you can show it beating SummarizationMiddleware + ContextEditingMiddleware on a public benchmark with reproducible numbers, people will pay attention. The “thesis coherence over long context” angle is the right framing - that’s the pain point summarization doesn’t fully solve.
Happy to look at a PR/branch if you put one up against the v1 middleware API. If you want a concrete starting point, I’d scaffold JudgmentControlMiddleware(AgentMiddleware[AgentState[ResponseT], ContextT, ResponseT]) with a before_model that writes state["jcm_decision"] and a wrap_model_call that reads it and routes accordingly. That’s ~50 lines and gives you something people can pip install and try in one line.
I hope this helps, spend alot of understanding your repo
and I love the overall concept.