Hi @gouveags
Going through your four questions with the relevant source/docs pointers:
1. Does /threads/{thread_id}/copy copy the full checkpoint/write chain?
Yes, by contract. BaseCheckpointSaver.copy_thread is explicit that implementations must copy the complete parent chain (all ancestor checkpoints and their checkpoint_writes) - copying only the head would silently break DeltaChannel reconstruction. That’s why there’s no built-in “copy depth” knob. The platform handler (libs/langgraph-api/src/api/threads.mts → storage/ops.mts) is a thin wrapper that awaits checkpointer.copy(...) synchronously, and the REST schema confirms: no request body, no ?async, no progress endpoint.
One amplifier worth checking: if you’re on BYOC with a custom checkpointer that hasn’t implemented acopy_thread, the platform falls back to a slow path that “re-inserts checkpoints one by one” (custom-checkpointer docs). On a tool-heavy long thread that’s exactly the regime where you hit double-digit minutes.
2. Is there a supported shallow-copy pattern?
Yes - not via copy, but via threads.create(supersteps=...). The SDK docstring literally says “Used for copying a thread between deployments.” and the Prepopulated state doc covers the same shape. It’s O(1) in history size and avoids the DeltaChannel issue because channels get a fresh snapshot through __start__:
state = await client.threads.get_state(thread_id=source_thread_id)
forked = await client.threads.create(
graph_id="agent",
metadata={"forked_from": source_thread_id},
supersteps=[{
"updates": [{"values": state["values"], "as_node": "__start__"}]
}],
)
Trade-off: you lose time-travel history and mid-interrupt context. For most “duplicate this conversation” UX that’s fine; for “rewind to any past turn” it isn’t.
3. Recommended production pattern
Two paths, picked by what the user actually needs:
- Fork latest state (default, user-facing):
threads.create(supersteps=...) as above. Single roundtrip, no Lambda timeout risk.
- Fork with full history: move it off Lambda. Click → enqueue (SQS / Step Functions / Fargate) → return a
copy_job_id immediately → worker calls threads.copy(...) with a long HTTP timeout → UI polls. A 12-min synchronous call inside a 15-min Lambda will eventually lose that race no matter how you budget it.
For an “estimate before firing” signal, threads.get_history(thread_id, limit=...) + the size of values is a decent heuristic - flip the UI to “Fork latest state” automatically past a checkpoint-count / payload-size threshold.
Also worth knowing: if you only want to branch from a past point on the same thread, threads.update_state(thread_id, values, checkpoint_id=...) does that without any copy.
4. Feasibility of async API / depth option
Both look feasible. An async mode is straightforward (enqueue + poll, like runs). A “latest only” depth flag would essentially need to go through the same write path as supersteps (write a fresh head, not slice the chain) to stay safe with DeltaChannel. Until that lands, supersteps ≈ mode=latest_state and a worker queue ≈ async=true, both supported and stable.
Hope this helps!