Would pre-inference routing help long-context agent workflows?

simanggu · May 8, 2026, 1:21pm

Hi everyone,

I’m testing a small pre-inference controller for long-context LLM / agent sessions, and I’d love feedback from people building LangChain-style agent workflows.

The idea is to run a lightweight controller before the model call and route each input into one of:

FAST: answer with compact reference context
HOLD: wait / ask for clarification
BLOCK: prevent contradictory or unsafe execution
DEEP: run deeper reasoning
RECOVER: restore long-context consistency

Why I built it:
In long-running agent sessions, not every user input needs full reasoning or full context reprocessing. Some inputs are aligned with the current task, some conflict with prior constraints, and some require deeper reasoning.

Early local benchmark:
300-turn local test

tokens: 312,397 → 52,320 (-83.3%)
runtime: 1880s → 320s (-83.0%)
consistency score: 4.05 → 4.92 (+21.5%)

I’m not claiming generality yet — this is an early benchmark. I’m mainly looking for feedback on whether this kind of routing layer would make sense before LangChain model calls, tools, or agent execution.

GitHub:

Questions:

Would this fit as middleware before model/tool calls?
What LangChain benchmark or example would be most convincing?
Is there an existing LangChain pattern I should integrate with first?

keenborder786 · May 8, 2026, 2:14pm

Hello @simanggu Welcome to Langchain Community

Re: Would pre-inference routing help long-context agent workflows?

Short answer: yes, this is a sensible place to live, and LangChain v1 is actually built around exactly this assumption - the agent loop is a sequence of middleware steps wrapped around the model and tool calls, not a monolith. So a commit/hold/block controller like yours maps cleanly onto the existing extension points.

I took a look at the repo. The core mechanism (thesis check → slot carry → recovery recenter → risk gate → decision) is well-scoped, and the 100-turn benchmark showing 18 full executions vs 100 baseline with preserved thesis coherence (4.97 vs 4.52) is much more credible than a generic “we saved 83% tokens” claim. A few thoughts on each of your questions:

Yes:

before_model / after_model - cheap pre/post hooks around each model call. This is the natural home for your commit/hold/block decision. before_model can return a state patch and a jump_to of "tools" | "model" | "end" (declared via hook_config(can_jump_to=...) or the @before_model(can_jump_to=...) decorator). That’s exactly the “skip the big model” (commit with mini) or “pause and ask for clarification” (hold) or “refuse execution” (block) behavior you want.
wrap_model_call / awrap_model_call — wraps the model call itself with access to a ModelRequest you can mutate (swap model, swap messages, change tool_choice, change tools). This is where you’d route “commit” decisions to mini vs standard models, or where you’d implement the “block” veto by raising before handler(request) is called.
wrap_tool_call - same pattern, but around tools. Useful if “block” should also veto unsafe tool invocations (e.g., destructive file changes, API-breaking edits), not just model calls.

So your three verbs map roughly:

commit (mini) → wrap_model_call swaps request.model to a lightweight model and calls handler(request).
commit (standard) → wrap_model_call lets the standard model proceed as-is.
hold → before_model returns {"jump_to": "end", "messages": [AIMessage(content="...clarification request...")]}. Or compose with HumanInTheLoopMiddleware (InterruptOnConfig) if you want a real interrupt-and-resume flow instead of just emitting text.
block → before_model or wrap_tool_call raises or returns a refusal message. Pairs naturally with existing middlewares like PIIMiddleware (blocks PII) and ToolCallLimitMiddleware (blocks runaway tool use).

Slot carry + recovery recenter are really context management strategies, which means they overlap with:

SummarizationMiddleware - compresses message history when it gets long. Your “recovery recenter” is conceptually “detect drift and re-anchor around the original thesis,” which is a selective summarization triggered by your controller rather than length-based.
ContextEditingMiddleware (with ClearToolUsesEdit) - removes stale tool I/O from history. Your slot carry is the inverse: “keep these N key facts in every turn’s context.”

If you build it as a single JudgmentControlMiddleware(AgentMiddleware) that:

In before_model, runs your lightweight classifier on state["messages"] and writes state["jcm_decision"] (use a TypedDict extension of AgentState, that’s the documented pattern).
For hold or block, returns {"jump_to": "end", "messages": [...]} (declare can_jump_to=["end"]).
For commit, returns nothing from before_model and in wrap_model_call swaps request.model to “mini” or lets “standard” proceed.
For recovery recenter, patches state["messages"] (drop stale turns, re-inject thesis anchor) before the model call - or delegates to SummarizationMiddleware rather than reimplementing.

…then a v1 user can drop it into create_agent(..., middleware=[JudgmentControlMiddleware(...), SummarizationMiddleware(...)]) in one line, which is the bar for “would I actually try this.”

Your 100-turn benchmark (18 full executions vs 100 baseline, 4.97 vs 4.52 thesis score, −85% latency, −85% tokens) is already much better than generic “we saved tokens” claims. To make it land harder with the LangChain community:

Compare against the honest baseline: create_agent with SummarizationMiddleware + ContextEditingMiddleware(ClearToolUsesEdit) + ToolCallLimitMiddleware is what people are already running. If you beat that by 20–30% on tokens/latency at matched thesis coherence, that’s a real result. Beating naive “send the whole transcript every turn” isn’t surprising.
Use a public, reproducible task: τ-bench / τ²-bench (airline, retail), SWE-bench Verified subsets, or the LangChain-published agent eval traces are artifacts reviewers can re-run. A proprietary 100-turn transcript is hard to validate.
Report task success, not just tokens: “−85% tokens” is meaningless if success rate dropped. The interesting metric is tokens-per-successful-trajectory and latency-per-successful-trajectory at matched success rate. Add confidence intervals over ≥3 seeds.
Define “thesis score: 4.97” precisely: Who/what is the judge, what’s the rubric, what’s the inter-rater agreement? If it’s an LLM-as-judge, say which model and publish the prompt, otherwise that number reads as a vibe.
Publish controller overhead: The controller itself is a lightweight call, but people will immediately ask “what’s the latency and token cost of the judgment layer?” Show the break-even point - i.e., “above N turns of context, the controller pays for itself.” That single chart is probably the most persuasive artifact you could ship.

I’d start by composing on top of these instead of replacing them - your controller is a policy and they are the mechanisms it should drive:

SummarizationMiddleware - your “recovery recenter” is essentially “trigger summarization now, on this slice, around this thesis anchor.” Don’t reimplement the summarization loop; call into it or patch state["messages"] and let it run.
ContextEditingMiddleware with ClearToolUsesEdit - drop stale tool I/O when you decide a turn is safe and doesn’t need that history. Your slot carry is the inverse (keep key facts), but they compose cleanly.
ModelFallbackMiddleware / wrap_model_call, your “commit (mini)” vs “commit (standard)” is literally the model-swap pattern. Use request.override(model=mini_model) inside wrap_model_call.
HumanInTheLoopMiddleware (InterruptOnConfig) - your “hold” is much more useful if it can actually interrupt and resume, not just emit a clarification message and hope the user responds.
ToolCallLimitMiddleware and PIIMiddleware - your “block” overlaps with these; if your controller is going to veto, it should at least respect/surface the same signals (e.g., if PIIMiddleware would have blocked, your controller should show that in the reason).

Runtime security angle:

The “copied-session detection, replay resistance, permission-aware execution gates” you mention in the README is interesting but underspecified. If you mean “detect when state was forked/copied and refuse to execute because permissions may not transfer,” that’s a novel concern I haven’t seen middleware handle before. But it’s also the kind of thing that needs a threat model doc, because reviewers will immediately ask “what attack does this stop?” If it’s more about “detect when the agent’s internal state is inconsistent with the user’s intent,” that’s really just another flavor of drift detection (which is covered by the thesis check).

One non-obvious gotcha:

You say “Skipped does not mean ignored. It means low-risk turns were safely handled without full expensive execution.” That’s great framing, but you need to show how they were handled. Did the controller synthesize a response itself? Did it delegate to a mini model? Did it skip the model call entirely and return a cached/templated response? That distinction matters for the integration story - if you’re synthesizing responses outside the model, that’s a before_model + jump_to="end" + inject AIMessage. If you’re routing to a mini model, that’s wrap_model_call with model swap. The middleware API supports both, but they’re different hooks.

Bottom line:

This is a good direction. The LangChain v1 middleware surface was designed for exactly this kind of controller, and if you can show it beating SummarizationMiddleware + ContextEditingMiddleware on a public benchmark with reproducible numbers, people will pay attention. The “thesis coherence over long context” angle is the right framing - that’s the pain point summarization doesn’t fully solve.

Happy to look at a PR/branch if you put one up against the v1 middleware API. If you want a concrete starting point, I’d scaffold JudgmentControlMiddleware(AgentMiddleware[AgentState[ResponseT], ContextT, ResponseT]) with a before_model that writes state["jcm_decision"] and a wrap_model_call that reads it and routes accordingly. That’s ~50 lines and gives you something people can pip install and try in one line.

I hope this helps, spend alot of understanding your repo and I love the overall concept.

simanggu · May 10, 2026, 1:09pm

Hi, Mohammad Mohtashim Khan,

Thanks for taking the time to read the project carefully. Your mapping to before_model, wrap_model_call, wrap_tool_call, and human-in-the-loop is exactly the integration direction I wanted to validate.

I agree that the right first artifact is not a new agent framework, but a small LangChain-native middleware layer.

My current plan is to keep the LangChain side intentionally thin:

before_model calls an external vPUF-CFE Gateway and writes state["jcm_decision"].
commit continues to the model call.
hold returns an approval-required response or routes into human-in-the-loop.
block stops before model invocation.
wrap_model_call can route fast vs. deep decisions to different models.
wrap_tool_call can veto unsafe tool execution or require approval before continuing.

The production trust-anchor logic, session-integrity checks, replay resistance, permission-aware execution gate, and STCU audit ledger would remain inside the external vPUF-CFE Gateway. The LangChain package would only be the adapter.

Since my original post, I also hardened the Python V1 gateway into a more product-like PoC:

External red-team: 1000 / 1000 pass
False negatives: 0
False positives: 0
Unsafe LLM calls: 0
HTTP errors: 0
Native on-prem smoke: 10 / 10 pass
Current gateway decision p95: around 11.56ms in the B15 run

I do not want to overclaim that number before testing it inside a LangChain-native middleware benchmark. My next step is to measure the extra middleware overhead and compare against the honest baseline you suggested: SummarizationMiddleware, ContextEditingMiddleware, ToolCallLimitMiddleware, and human-in-the-loop.

If this direction looks reasonable, I’ll prepare a minimal vpuf-langchain prototype with:

a thin JudgmentControlMiddleware
a small HTTP client for the local vPUF-CFE Gateway
a toy demo controller for open-source use
one coding-agent safety example
a benchmark harness for controller overhead and end-to-end latency

I’d appreciate any guidance on the most stable v1 middleware pattern to target first, especially for jump_to="end", tool-call review, and LangSmith trace metadata.

Best,
Jason Sim

2026년 5월 8일 (금) 오후 11:24, Mohammad Mohtashim Khan <notifications@langchain.discoursemail.com>님이 작성:

simanggu · May 10, 2026, 6:32pm

Re, Hi Mohammad Mohtashim Khan,

I prepared a small public-safe PoC repo based on your feedback:

https://github.com/simanggu/langchain-external-control-plane-poc

It keeps the LangChain side intentionally thin and demonstrates four tracks:

state-bound runtime security,
runtime safety and authorization,
controller overhead / break-even,
long-context thesis continuity.

The production vPUF-CFE core is not included. The repo only exposes the adapter boundary, toy gateway, public schemas, mock anchor proof, STCU-style demo chain, and reproducible local benchmarks.

The strongest signal so far is the long-context track: at the same model-call budget as a summary-memory baseline, the external phase-memory controller improved pass rate from 70% to 100% and reduced total token usage by 40.39% in the local fake-model benchmark.

I’d be interested in which track feels most relevant to LangChain/LangGraph users: state-bound runtime security, long-context thesis continuity, runtime safety/tool authorization, or break-even overhead.

Best,
Jason Sim

2026년 5월 10일 (일) 오후 10:09, Donghyeon Sim <simanggu@gmail.com>님이 작성:

keenborder786 · May 11, 2026, 12:18pm

simanggu:

Re, Hi Mohammad Mohtashim Khan,

I prepared a small public-safe PoC repo based on your feedback:

GitHub - simanggu/langchain-external-control-plane-poc · GitHub

It keeps the LangChain side intentionally thin and demonstrates four tracks:

state-bound runtime security,

runtime safety and authorization,

controller overhead / break-even,

long-context thesis continuity.

The production vPUF-CFE core is not included. The repo only exposes the adapter boundary, toy gateway, public schemas, mock anchor proof, STCU-style demo chain, and reproducible local benchmarks.

The strongest signal so far is the long-context track: at the same model-call budget as a summary-memory baseline, the external phase-memory controller improved pass rate from 70% to 100% and reduced total token usage by 40.39% in the local fake-model benchmark.

I’d be interested in which track feels most relevant to LangChain/LangGraph users: state-bound runtime security, long-context thesis continuity, runtime safety/tool authorization, or break-even overhead.

Best,
Jason Sim

Okay, please give me some time as I will have a detailed look at it.

keenborder786 · May 11, 2026, 2:30pm

Direct answer to your question: Track 3 (long-context thesis continuity) is by far the most relevant to current LangChain/LangGraph users, and it’s also your strongest result.

Here’s why each track lands differently:

Track 3: Long-context thesis continuity (MOST RELEVANT)

Why it matters:

This is the pain point SummarizationMiddleware + ContextEditingMiddleware don’t fully solve. Those middlewares compress context and drop stale tool I/O, but they don’t actively detect drift or preserve task thesis across 100+ turns.
Your 70% → 100% pass rate improvement at the same model-call budget (60 calls each) is the killer result. It’s not just “we saved tokens by skipping calls”, you improved success rate while also reducing tokens by 40.39% vs summary memory.
The 58.27% improvement in tokens-per-successful-trajectory is a clean, interpretable metric that addresses the “−85% tokens” skepticism from the original post.

What makes it credible:

You’re comparing against summary memory, not naive “no memory” baseline , that’s the honest baseline I recommended.
The behaviors you’re surfacing (recenter_count: 10, recovery_recenter_count: 30, decay_applied_count: 19) are genuinely novel. SummarizationMiddleware does length-triggered compression, but it doesn’t do thesis-triggered recenter or slot decay. That’s a real architectural distinction.

Next step to make it bulletproof:

Run this same benchmark with real models (even just gpt-4o-mini / gpt-4o) and show actual token billing + latency. The fake-model numbers are fine for development, but LangChain users will want to see real API costs. You could add an --real-model flag and publish both results.
Add a comparison against the actual LangChain middlewares. Something like: create_agent(..., middleware=[SummarizationMiddleware(trigger_tokens=4000), ContextEditingMiddleware(ClearToolUsesEdit)]) and show your controller beats that on the same 100-turn benchmark. That’s the comparison that will get attention.

Track 0: State-bound runtime security (NICHE)

Why it’s less relevant:

Most LangChain users aren’t thinking about device anchors, vPUF proofs, nonce replay attacks, or STCU ledgers. This is a novel security model, but it’s solving a threat model (session forgery, copied-session attacks, permission drift) that hasn’t been articulated in the LangChain community yet.
The phrase “state-bound runtime security” is abstract. When I read the report, I understand the mechanics (anchor proof validation, actor continuity, permission continuity), but I don’t immediately see the attack scenario this prevents. Without a concrete threat story, reviewers will skip this.

How to make it more accessible:

Lead with a concrete attack scenario instead of the mechanism. Something like: “Track 0 prevents an attacker who copied your LangSmith session ID from hijacking your agent mid-conversation, even if they have your API key.” Or: “Track 0 detects when an agent session was forked and prevents the forked copy from executing with the original permissions.” That’s a threat model people can visualize.
Right now, the vPUF / STCU / anchor-proof language makes this feel like a cryptographic research prototype rather than a practical security layer. If you want this to land with the LangChain community, frame it as “session integrity for long-running agents” and show a before/after attack demo.

That said: If you’re pitching this to enterprise LLMOps teams (not open-source LangChain users), Track 0 might be your most differentiated feature. But for the forum thread you posted, it’s too abstract.

Track 1: Runtime safety and authorization (OVERLAP)

Why it’s less compelling:

This overlaps heavily with existing LangChain middlewares: PIIMiddleware, ToolCallLimitMiddleware, HumanInTheLoopMiddleware. Your 100/100 pass rate with zero unsafe pre-model calls and zero unsafe tool executions is good, but LangChain users can already achieve this by composing the existing middlewares.
The “30/30 tool-gate successes” and “15 recenter decisions” are interesting, but without more detail on what those decisions were, it’s hard to evaluate novelty.

Next step:

Show a case where your external control plane catches something the native middlewares miss, or where composition of native middlewares would be verbose/brittle and your controller handles it cleanly in one place. Right now, the report shows you matched the baseline, not that you beat it.

Track 2: Efficiency / overhead / break-even (TABLE STAKES)

Why it matters but isn’t a feature:

Controller overhead is something people will ask about, but “we matched the baseline on model calls and added minimal latency” isn’t a compelling value proposition — it’s table stakes. If your overhead was 500ms+ per call, that would be a blocker. But “http_p50_ms: 0.32ms, http_max_ms: 8.87ms” is fast enough that people won’t care.
The efficiency results (50% → 100% pass rate, 100 calls → 50 calls) are good, but they’re better framed as Track 3 results (long-context token reduction) rather than a standalone overhead story.

Next step:

I’d merge this into Track 3’s narrative: “The controller adds <1ms median latency and pays for itself by turn N” is a footnote, not a headline.

I hope this gives you some insights, if it did a heart would be appreciated

simanggu · May 11, 2026, 7:25pm

Hi Mohammad Mohtashim Khan,

I pushed a follow-up repo with a real-model GPT-5.4-mini randomized robustness track.

The most interesting result is not the fixed-template 100% pass rate, but the randomized stress loop:

v1 exposed paraphrase coverage gaps,
I patched the private controller from exact phrase rules to intent-family rules,
then reran unseen seeds.

Across five unseen 100-turn randomized runs:

pass rate: 42.2% baseline → 99.6% private controller
tokens: -72.6%
estimated API cost: -66.1%
model calls: 79.0 → 32.4
model p95 latency: -16.2%
total model latency: -57.2%
model latency per successful turn: -81.9%

Caveat: this is still a stress-style PoC, not a production traffic benchmark. p99 worsened due to one outlier, so I’m treating p95 as the primary tail-latency metric and p99 as diagnostic.

Best,
Jason Sim

2026년 5월 11일 (월) 오후 11:40, Mohammad Mohtashim Khan <notifications@langchain.discoursemail.com>님이 작성:

keenborder786 · May 11, 2026, 8:11pm

Will have a look.

Topic		Replies	Views
Built llmsessioncontract on AgentMiddleware: runtime enforcement of tool-call protocols — feedback wanted LangChain python-help	4	132	May 8, 2026
How to use Langchain v1.x middleware in langgraph? LangChain python-help	3	1176	November 3, 2025
🚀 `langchain` 1.0 – Feedback Wanted! Announcements	29	3840	April 7, 2026
Three thin runtime-control adapters for LangChain agents: trust gate, budget gate, and post-run receipt Talking Shop	0	62	March 10, 2026
❓ How to Reduce Double Agent Calls in React architecture (LangGraph) & Reduce Latency LangGraph python-help	4	927	September 26, 2025

Would pre-inference routing help long-context agent workflows?

Track 3: Long-context thesis continuity (MOST RELEVANT)

Track 0: State-bound runtime security (NICHE)

Track 1: Runtime safety and authorization (OVERLAP)

Track 2: Efficiency / overhead / break-even (TABLE STAKES)

Related topics