I’ve been spending time on multi-agent workflows lately, and I keep coming back to the same question:
once multiple agents or tool calls can mutate the same file / DB row / shared state, what is the expected safety model?
The failure mode I’m seeing is pretty mundane, but nasty:
agent A reads shared state
agent B reads the same shared state
both make reasonable updates
one write lands after the other
the final state is syntactically fine, but part of the work is gone
For example, I’ve seen this when two agents are editing the same repo or updating the same task list. Both changes look valid, but one silently wipes part of the other.
So this doesn’t really feel like “the model got confused.” It looks like a plain read-modify-write race.
And I don’t think this stays an edge case for long. The moment multiple agents intentionally collaborate on the same resource, this pattern seems unavoidable.
My current view is that orchestration frameworks help with sequencing, but don’t guarantee correctness once multiple workers can mutate the same resource.
I’ve been working on an early project called Klock around this exact problem, so take that with the right bias. The thing we’re testing is a coordination layer around shared mutable resources so conflicting writes don’t silently overwrite each other. Still early, not posting this as a polished launch. I’m mostly trying to sanity-check the problem first.
A few concrete questions for people here:
Are you seeing this in real multi-agent setups, or mostly in demos?
When you do see it, how are you handling it today?
Do you treat it as an application concern, or do you think the ecosystem should have a standard safety pattern for it?
If useful, I can share a tiny repro that shows two workers silently stepping on each other’s updates, and the same flow with coordination added.
Hello! This class of problems is why State exists in LangGraph - updates are done by returning from a node, and parallelism is handled via a reducer. The underlying executor prevents data races in that way.
If you’re trying to mutate an untracked object, modify an external db, etc., then you’ll have to rely on that external untracked object’s concurrency mechanism (transactions, etc.).
A common pattern at the agent level if you’re going that way is to track the last read time and only commit the write if the agent has read from the particular file/object since the last edit. Depending on your object type, other collaboration primitives exist, of course. For instance, crdts are useful for making a collaborative doc, but they really only guarantee that the different agents have a consistent view of the world rather than ensuring that that is “correct”.
Would love to hear more about what you’re doing, however. We’ve debated adding some more advanced channel types/APIs that would allow other update pattern in the past.
State + reducer makes sense to me for handling concurrency inside the graph.
The cases I keep running into are just outside that though, like two agents editing the same repo, writing to the same table/queue, or touching the same shared file where the resource itself isn’t really part of the graph state.
In those situations it feels like you end up back at whatever the underlying system gives you (transactions, version checks, etc.), or you rebuild some form of that logic at the app layer.
That’s the part I’ve been exploring with klock, not really how to merge state inside the graph, but what the safety model looks like once multiple agents can mutate the same external resource.
Interesting that you’ve thought about more advanced channel types here. Curious how you think about that boundary, should this stay outside LangGraph, or something the ecosystem might eventually handle more directly?
This maps to something I keep seeing too — orchestrators hand you correctness on state that lives inside the graph (reducers, channel semantics, wfh’s point above), but the moment the agents reach outside the graph to a repo, a row, or a file, you’re back in plain read-modify-write territory and you have to rebuild version-check / CAS / leasing at the app layer. The nastiest part is exactly what you said: the final state is syntactically fine, the diff looks reasonable, and one side’s work is just gone.
Genuinely curious — in the cases you ran into (two agents editing the same repo / the same task list) how did you actually notice the silent overwrite? Was it a user complaint, a diff review, or did you catch it because one of the agents’ own checks started failing downstream?
The first way we noticed it was not from a crash. The build still passed, the repo state looked normal, but when we ran the app, part of the feature was just missing.
After that, we started comparing what each agent said it had completed against what actually survived in the repo. That made the failure obvious: an agent would report success, but a later valid write had overwritten the same surface.
We then tested it more directly with 5 Claude Code agents on the same Express.js repo. In the concurrent run, about 24% of intended changes disappeared while the build still passed, and one agent’s auth routes were completely lost.
So I think this lines up with @wfh distinction above: LangGraph state/reducers give you a safety model inside the graph, but once agents mutate external shared resources like repo files, you’re back in plain read-modify-write territory.
Orchestration and scoped tasks help reduce overlap, but they don’t fully solve that boundary if two workers can still touch the same shared surface. That’s what pushed us toward Klock: pre-write coordination for cooperative file-mutating agents. Not as a replacement for LangGraph state, but as a separate layer for shared resources outside the graph.
Important limitation: OSS v1 is still narrow. It works for cooperative agents that call the SDK or wrapped tools before mutation. It’s not transparent filesystem-level enforcement yet.
The race condition that causes the most subtle damage is not in execution — it is in retrieval. When two agents query the same vector store concurrently and one of them writes back a modified chunk while the other is mid-retrieval,
the second agent can get a mix of pre-update and post-update chunks in the same retrieval set.
The agent has no way to detect this because both chunks look individually valid. The fix is not locking the vector store. The fix is chunk versioning — each chunk carries a version hash (content hash at write time) and the retrieval
set records which versions were returned. If a governance check later finds that two chunks in the same retrieval set have different version timestamps for the same source document, that retrieval set is flagged as potentially inconsistent. This is a data integrity problem disguised as a concurrency problem.