When is it actually a failure? Diagnosing agent behavior beyond LangGraph traces

Agents often:

  • return plausible but incorrect answers
  • continue after tools return no data
  • quietly fall back to general knowledge

LangGraph + tracing tools (LangSmith, etc.) make it easy to see what happened.

But in practice:

it’s still hard to tell whether the behavior is actually a failure.


Example (see screenshot)

In this run:

  • the tool returned no data
  • the agent acknowledged the gap
  • and still produced a general answer

The system evaluates it as:

  • no failure detected
  • risk: LOW → because the agent explicitly disclosed the lack of grounding
  • no fix needed

Failure vs Acceptable (explicit definition)

In this system:

  • Acceptable
    → no data + uncertainty acknowledged

  • Failure
    → no data + confident answer without grounding


Approach

A lightweight diagnostic layer on top of LangGraph:

from adapters.callback_handler import watch

graph = watch(workflow.compile(), auto_diagnose=True)
result = graph.invoke({"messages": [...]})

No changes to the workflow.


How the classification works

Currently rule-based (no LLM) for:

  • deterministic behavior
  • easier debugging
  • no additional model cost

Using runtime signals such as:

  • tool outputs (data vs empty/error)
  • uncertainty patterns (e.g. “I couldn’t find”, “no data available”)
  • execution flow (tool usage vs direct answer)

Example signals:

  • tool_provided_data = False → no grounding
  • uncertainty_acknowledged = True/False
  • answer without data → fallback or hallucination

What this adds beyond tracing

LangGraph / LangSmith:
→ execution traces

This layer adds:

  • structural interpretation
  • failure classification
  • risk assessment
  • action recommendation
trace → signals → interpretation → risk → action

Example actions:

  • block answer generation
  • retry with different tool

Scope

This example focuses on fallback vs hallucination,
but the same framework extends to:

  • hallucination after tool failure
  • tool call loops
  • retrieval drift in RAG systems

Limitations

This is intentionally heuristic and imperfect:

  • may misclassify nuanced phrasing
  • depends on pattern detection (not full semantic understanding)

Designed as a debugging signal, not strict evaluation.


Code

If you want to see how this works in practice:

Minimal example (~10 sec, no API key)

Full system:


Closing

The goal is not to be perfectly correct,
but to make failure modes explicit and debuggable.


Looking for feedback

Interested in real-world cases like:

  • hallucination after tool failure
  • silent tool loops
  • confident but irrelevant outputs

How are you distinguishing
“failure” vs “acceptable behavior” in production?

The three failure modes you listed — plausible but wrong, continuing after empty tool returns, falling back to general knowledge — have a common root cause that traces don’t surface. In most cases: the retrieved chunks were technically present but informationally insufficient. The agent didn’t fail because the retriever missed. It failed because what was retrieved looked relevant but lacked the specific content needed to answer correctly.

LangGraph traces show you the retrieval happened. They don’t show you whether the retrieved chunk was semantically dense enough to be useful, or whether the critical information was split across two chunks during ingestion and neither half contains a complete thought. The diagnostic question that’s missing from most agent
observability setups:

For each retrieval step that preceded a wrong answer - what was the intrinsic quality of the chunks returned?
Not “were they the most similar?” but “did they actually contain answerable information?” Once you score the chunks themselves, the pattern usually becomes obvious. Certain document sources or certain chunking boundaries produce consistently low-quality chunks.
Those are the specific areas to fix — not the agent architecture, not the prompt, not the retriever parameters.

The agent is often behaving correctly given what it was given. The problem is what it was given!

Good point — I agree that chunk quality is often the underlying issue. In many cases, the agent is behaving correctly given its inputs, but those inputs are insufficient or fragmented.

This is intentionally outside the current scope. The toolchain focuses on agent behavioral signals (e.g., whether the agent used tool outputs, acknowledged uncertainty, or entered loops) rather than evaluating the quality of retrieved content itself. The design is deliberately deterministic and embedding-free: it can detect “the agent answered without grounding”, but not “the grounding existed but was informationally insufficient.”

What you’re describing — scoring chunks based on intrinsic answerability rather than similarity — is a complementary upstream layer.

In that sense, the separation becomes:

  • Upstream (your approach): Diagnose input quality
    → “Did the retrieved chunks actually contain the information needed to answer?”

  • Downstream (this toolchain): Diagnose agent behavior
    → “Given what it received, did the agent respond appropriately?”

Both are necessary.
This layer tells you that a failure occurred after retrieval.
Your layer explains why — the chunks appeared relevant but lacked sufficient informational density.

You’ve articulated the separation perfectly ! - and it’s exactly the architecture we’ve been building toward.
The upstream layer you describe is what we built with ChunkScore (ragprep.com). The core thesis: similarity-based retrieval answers “did we find something that looks relevant?” but not “does this chunk actually contain enough informational density to answer the question?” A chunk can score 0.95 cosine similarity and still be informationally hollow - it contains the right vocabulary but not the right knowledge.

What we found in practice is that most RAG failures that get blamed on the agent or the LLM are actually chunk quality failures in disguise. The agent behaves rationally given its inputs. The inputs are the problem. Your toolchain correctly diagnoses that the failure occurred post-retrieval. Ours diagnoses why? - the chunks passed the similarity gate but failed the answerability test.
The practical stack becomes:

  1. ChunkScore / RAGPrep (upstream): Score chunks on intrinsic answerability, structural coherence, and informational density before they enter the vector store. Bad chunks never reach retrieval.
  2. Your toolchain (downstream): Monitor agent behavioral signals — did it ground its answer, acknowledge uncertainty, or enter loops? Diagnose failures that survive the upstream filter.
  3. The feedback loop (where it gets interesting): When your toolchain detects “agent answered without grounding” or “agent entered a retrieval loop,” that signal feeds back upstream to flag which chunks or chunk boundaries caused the failure. The downstream diagnosis becomes the upstream training signal.
    The deterministic, embedding-free approach you’ve taken is very smart — it means the behavioral layer doesn’t inherit the same blindspots as the retrieval layer. Different failure modes, different detection mechanisms, complementary coverage.
    Would be interested to explore whether there’s a clean integration point between the two — particularly the feedback loop from behavioral diagnosis to chunk quality scoring.

That feedback loop is where things get really interesting.

Right now the two layers diagnose different projections of the same failure — upstream catches whether the input was actually answerable, downstream captures how the agent behaved given that input. The missing piece is attribution: when the behavioral layer detects “answered without grounding” or a retrieval loop, how do we trace that back to specific chunks that were insufficient, rather than just saying the agent failed?

On this side, the toolchain emits structured signals on grounding failure — whether tools returned usable data, whether the agent disclosed the gap, and the ratio between source data and generated response. These are aggregated at the run level to capture the overall behavioral trajectory rather than individual call-level traces.

Would be interested to see how ChunkScore represents chunk-level signals on that side. If it exposes a surface that can accept runtime attribution rather than just pre-index scoring, that feels like a natural place where the loop could close.

Happy to share how the diagnostic output is structured if useful.

The attribution gap you are describing is precisely where ChunkScore’s current surface ends and where the next layer needs to begin. Right now ChunkScore operates pre-ingestion —it scores semantic density, completeness,
faithfulness prediction, conflict flags, and freshness signals at the chunk level before any embedding occurs. The signals exist at chunk resolution. What does not exist yet is a runtime attribution surface that accepts a behavioral failure signal and maps it back to the specific chunks that were insufficient.

The gap you identified is real: when the behavioral layer detects “answered without grounding” or a retrieval loop, the current ChunkScore architecture has no way to receive that signal and use it to update the upstream quality model. The loop is open.

The chunk-level signals that would be most useful for your attribution use case are faithfulness prediction score per chunk, semantic density score per chunk, and conflict flag status. These are currently output at pre-ingestion time. Exposing them as a queryable API surface at runtime — keyed by chunk ID — is the architectural change that would allow your grounding failure signals to close the loop upstream.

The structured diagnostic output you described — grounding failure type, tool return quality, source-to-response ratio — is exactly the signal that would make pre-ingestion scoring more accurate over time if it could be fed back as labeled outcome data.

Would be very useful to see how your diagnostic output is structured. Specifically interested in whether you are keying failures to chunk IDs or document IDs, and what the schema looks like for a grounding failure event. That determines what the attribution API surface on the ChunkScore side needs to look like.

That’s a very clear breakdown of the chunk-level signals.

On the diagnostic side, the current output is not keyed to chunk IDs or document IDs. The grounding failure signals are aggregated at the run level — they capture whether any tool returned usable data, the ratio between source data and generated response, and whether error conditions were present in tool outputs. Per-run summaries, not per-chunk.

The gap you’re pointing to feels less like a missing field and more like a representation mismatch: sequence-level post-transformation behavior on one side, unit-level pre-retrieval quality on the other. In principle, if chunk identity is available in the retrieval trace, a linking layer could associate a failure event with candidate chunks from that run. But attribution becomes ambiguous once multiple chunks contribute partially or the agent mixes information across steps.

Curious how you’re handling chunk identity at runtime — especially in cases where the mapping isn’t one-to-one.

The honest answer is that ChunkScore does not currently track chunk identity at runtime. The quality signals are computed pre-ingestion and stored as metadata at the chunk level, but once chunks enter the vector store there
is no runtime hook that links a retrieval event back to those pre-computed scores.

Your framing of it as a representation mismatch is more precise than how I had been thinking about it. The problem is not a missing field in either system — it is that the two representations are not in the same coordinate space. Pre-ingestion quality is a property of a chunk in isolation. Run-level behavioral signals are properties of a sequence of events. Connecting them requires something that neither system currently produces: a retrieval trace that preserves chunk identity through the transformation pipeline.

The partial contribution problem you raised is the hard case. If three chunks each contribute partially to an answer and the behavioral layer detects a grounding failure, the failure is a function of the combination not any individual chunk. Attribution at the chunk level in that case is not just ambiguous — it may be the wrong unit of analysis entirely. What you actually want to know is whether the retrieved set collectively contained sufficient grounding signal, not whether any individual chunk was deficient.

That suggests the linking layer needs to operate at the retrieval set level, not the chunk level. A retrieval set quality score — the aggregate grounding potential of the chunks that were actually retrieved together for a given query
— would be a meaningful unit that sits between your run-level behavioral signals and the individual chunk quality scores ChunkScore currently produces.

The open question on the ChunkScore side is whether the vector store preserves enough chunk identity information in the retrieval trace to reconstruct which pre-ingestion quality scores correspond to which retrieved
passages. That depends heavily on how the retrieval pipeline is instrumented. In most production deployments the answer is no — chunk identity is lost between embedding and retrieval unless someone explicitly built provenance tracking into the pipeline.

Would it be useful to sketch what minimum instrumentation the retrieval pipeline would need to make the linking layer feasible — what specifically needs to be preserved at retrieval time on both sides for the representation mismatch to be bridgeable?

The retrieval-set framing feels like a much more natural unit than chunk-level attribution — it matches how the failure actually manifests.

On the instrumentation question, the minimal requirement seems less about adding new scoring signals and more about preserving alignment between three things at runtime:

  • what was retrieved (identity / grouping of chunks as a set)
  • how that information was transformed across steps (tool outputs, intermediate states)
  • the final behavioral outcome (grounding success or failure)

If those three can be kept in the same trace, attribution becomes analyzable rather than inferred.

From the behavioral side, this likely just requires exposing identifiers when they’re available in the execution trace, rather than introducing new signal types.

The harder part is on the retrieval pipeline side, and you’ve already identified it — chunk identity is lost in most deployments unless provenance tracking is explicitly built in. If you sketch what minimum provenance the retrieval side would need to preserve, I can map the corresponding interface on the behavioral side. That gives both sides a concrete spec rather than an abstract integration idea.

Hi K,

Here is a sketch of the minimum provenance the retrieval side would need to preserve for the linking layer to work.

Three things must survive from retrieval through to the behavioral trace:

  1. CHUNK SET ID
    A unique identifier for the specific set of chunks retrieved for a given query. Not individual chunk IDs — the set as a unit. This is what your behavioral signals would key against. When your system detects a grounding failure on run X, it needs to reference “the chunks that were retrieved for query Y in run X” as a single addressable unit. Format: a hash of the sorted chunk IDs retrieved for that query. Deterministic, reproducible, and independent of retrieval order.
  2. PER-CHUNK QUALITY METADATA
    Each chunk in the retrieved set carries its pre-ingestion quality scores as metadata fields in the vector store. At minimum:
  • semantic_density (0-1)
  • completeness (0-1)
  • faithfulness_prediction (0-1)
  • freshness_status (fresh/aging/stale/rot)
  • conflict_flag (boolean)
  • chunk_source_document_id (string)

These are computed once at ingestion time and stored as payload metadata alongside the embedding.
No runtime computation required — they travel with the chunk.

  1. RETRIEVAL EVENT LOG
    A structured log entry emitted at retrieval time containing:
  • query_text or query_hash
  • chunk_set_id (from point 1)
  • individual chunk_ids in the set
  • retrieval_scores (similarity scores per chunk)
  • aggregate_quality_score (mean of per-chunk
    quality scores weighted by retrieval rank)
  • timestamp

This log entry is the join key. Your behavioral trace references the same chunk_set_id. A failure event on the behavioral side can be looked up against the retrieval event log to see exactly what was retrieved and what its pre-computed quality profile was.

The aggregate quality score at the retrieval set level is the signal that closes the loop. If your behavioral layer detects a grounding failure and the corresponding retrieval set has an aggregate quality score below a threshold — the failure is attributable to upstream data quality. If the aggregate score is high but the behavioral signal
still shows failure — the failure is in the transformation or generation layer, not the data. That distinction is what neither system can currently make on its own.

The implementation cost on the retrieval side is low — it requires adding metadata fields at ingestion time (ChunkScore already computes these) and emitting a structured log at retrieval time (a wrapper around the vector store query). The chunk_set_id is a derived value, not a new entity.

What does the corresponding interface on the behavioral side need to look like to consume this? Specifically — does your system need to receive the chunk_set_id in the execution trace passively, or does it need to query the retrieval event log actively when a failure is detected?

That’s a clean spec — the chunk_set_id as a hash of sorted chunk IDs is a good design choice. Deterministic and order-independent.

On your question: passive. The behavioral side consumes whatever is available in the execution trace rather than querying external systems.

If something like a chunk_set_id is present in the tool output or retrieval trace, it can be carried through and attached to the grounding signal, making failure events joinable against the retrieval event log. If not, the system continues to operate as it does today.

So the linking effectively happens outside both systems — using the identifier as a join key in a shared analysis layer — which keeps both sides loosely coupled and independently deployable.

One thing worth flagging: the aggregate quality score is a useful signal, but it assumes chunk contributions are roughly independent. In practice, a high-quality retrieval set can still produce a grounding failure depending on how information is selected or combined downstream. So the “data problem vs transformation problem” distinction is real, but may not be cleanly separable by a single aggregate score alone.

Here is the minimum provenance the retrieval pipeline would need to preserve for the linking layer to work.

Three fields per retrieval event, carried through the execution trace:

  1. CHUNK IDENTITY
    Each retrieved passage needs a stable identifier that maps back to its pre-ingestion record.
    Minimum: a hash of the original chunk content at ingestion time, stored as metadata in the
    vector store alongside the embedding. This survives re-embedding and index rebuilds as long as the
    content is unchanged.

    Format: chunk_id (string, content hash)
    Set at: ingestion time
    Preserved through: vector store metadata, retrieval response, execution trace

  2. RETRIEVAL SET ENVELOPE
    The set of chunks retrieved for a single query step needs to be grouped as a unit. Not just the
    individual chunks but the query that produced them, the retrieval method used, and the rank order
    returned.

    Format:
    {
    retrieval_id: string (unique per retrieval event),
    query_text: string,
    method: string (dense / sparse / hybrid / reranked),
    chunks: [
    {
    chunk_id: string,
    rank: integer,
    similarity_score: float,
    pre_ingestion_quality: {
    semantic_density: float,
    completeness: float,
    faithfulness: float,
    freshness_status: string,
    conflict_flag: boolean
    }
    }
    ]
    }

    The pre_ingestion_quality block is what ChunkScore produces at ingestion time.
    Attaching it to the retrieval envelope means the behavioral layer can see both what was retrieved
    and what the upstream quality assessment was for each chunk in the set.

  3. TRANSFORMATION ANCHOR
    When the agent transforms retrieved content across steps, the retrieval_id needs to propagate. If
    step 3 uses output from step 2 which used retrieval set R1, then step 3 should carry R1 as an
    ancestor reference.
    This does not require tracking every intermediate transformation. It requires a lineage chain: step
    N references which retrieval sets contributed to its input.

    Format:
    {
    step_id: string,
    ancestor_retrievals: [retrieval_id, …],
    transformation_type: string
    }

The critical constraint: none of this requires the retrieval pipeline to compute anything new.
It requires the pipeline to preserve what it already knows and pass it forward instead of
discarding it.

Most vector stores already return chunk metadata and similarity scores. The gap is that this
information is consumed and discarded at the retrieval step rather than being forwarded into the execution trace. The instrumentation change is a passthrough, not a computation.

The partial contribution problem you raised becomes analyzable with this structure. When the behavioral layer detects a grounding failure on step N, it can look at ancestor_retrievals, pull the retrieval set envelopes, and see whether the failure correlates with low pre_ingestion_quality
scores across the contributing chunks, or whether the chunks were individually adequate
but the combination was insufficient.

That distinction tells you whether the fix is upstream (re-chunk or re-score) or downstream (change the agent’s retrieval strategy or combination logic).

This is the spec from the retrieval side. Interested to see what the corresponding interface looks like on the behavioral side for consuming these three fields.

The passive consumption model is the right call architecturally. Both systems remain independently deployable, the linking layer is just a join on chunk_set_id in a shared trace store, and neither side needs to know the other exists to function.

Your flag on aggregate quality assuming independence is the most important point in this thread. You are right that a retrieval set where every chunk scores 0.85+ on quality can still produce a grounding failure. The chunks are individually sound but the combination fails — either because they cover overlapping ground without adding new
information, or because they address adjacent aspects of the query without any single chunk fully grounding the answer.

That is a different failure mode from low individual chunk quality. It is a set composition failure — the retrieval set lacks coverage diversity even though it has quality density. This means a single aggregate score is
insufficient. The retrieval set quality signal needs at least two dimensions:

  1. Quality density — the average pre-ingestion quality of the chunks in the set (what the
    aggregate score captures today)

  2. Coverage diversity — whether the chunks in the set address distinct aspects of the
    query or redundantly cover the same ground

A retrieval set with high quality density but low coverage diversity is the specific case you are describing — individually strong chunks that collectively fail to ground the answer because they all say the same thing from
slightly different angles. Computing coverage diversity at the retrieval set level is feasible without additional API
calls. Pairwise semantic similarity within the retrieved set gives you a proxy — if all chunks in the set are 0.9+ similar to each other, coverage is low regardless of individual quality. If chunks span 0.3 to 0.7 similarity to each other, coverage is higher because they are addressing different facets of the query.

So the retrieval set quality signal becomes:

{
chunk_set_id: string,
quality_density: float,
coverage_diversity: float,
set_size: integer,
redundancy_ratio: float
}

Where redundancy_ratio is the proportion of chunk pairs with similarity above 0.85 within the set — directly measuring how much of the retrieval budget was spent on redundant content. The data-vs-transformation distinction then becomes:

High quality density + high coverage diversity

  • grounding failure = transformation problem (the data was sufficient, the agent misused it)

High quality density + low coverage diversity

  • grounding failure = retrieval composition problem (the data was individually good but
    collectively insufficient)

Low quality density + grounding failure = data problem (fix upstream)

That three-way distinction is cleaner than a binary data-vs-transformation split and directly actionable on both sides. Does that coverage diversity dimension map to anything you are already observing in the behavioral signals — specifically, can you distinguish cases where the agent received diverse evidence versus redundant evidence
from the behavioral trace alone?

To consolidate the two spec sketches into a single reference — here is the canonical version:

PROVENANCE SPEC v1.1

  1. CHUNK IDENTITY
    chunk_id: string
    — Content hash of the original chunk at ingestion time
    — Stored as vector store metadata
    — Deterministic, survives re-embedding

  2. RETRIEVAL SET ENVELOPE
    {
    retrieval_set_id: string (unique per event),
    chunk_set_hash: string (hash of sorted chunk_ids — deterministic, same chunks always produce same hash),
    query_text: string,
    method: string (dense/sparse/hybrid/reranked),
    timestamp: string,
    chunks: [
    {
    chunk_id: string,
    rank: integer,
    similarity_score: float,
    source_document_id: string,
    pre_ingestion_quality: {
    semantic_density: float,
    completeness: float,
    faithfulness: float,
    freshness_status: string,
    conflict_flag: boolean
    }
    }
    ],
    set_quality: {
    quality_density: float,
    coverage_diversity: float,
    redundancy_ratio: float
    }
    }

    retrieval_set_id = unique per event (for logging)
    chunk_set_hash = deterministic per chunk combination (for joining across events)

  3. TRANSFORMATION ANCHOR
    {
    step_id: string,
    ancestor_retrieval_set_ids: [string],
    transformation_type: string
    }

The chunk_set_hash is the join key your behavioral layer would use — it is deterministic so the same retrieval
composition always produces the same hash regardless of when or how many times it was retrieved. The retrieval_set_id is for event-level logging only.

This supersedes both earlier sketches.

Clean spec — the two-key design (chunk_set_hash for joining, retrieval_set_id for logging) and the consolidated provenance model make sense as a reference point.

The three-way distinction between data problem, retrieval composition problem, and transformation problem is a meaningful improvement over a binary split. It’s also directly actionable on both sides.

On your question: the behavioral layer cannot currently distinguish diverse evidence from redundant evidence. The closest signals are based on variation across tool outputs, but those capture surface-level differences rather than true semantic coverage. A retrieval set where every chunk says the same thing in different words would still appear diverse from the behavioral trace, while your coverage diversity metric would correctly identify it as redundant.

So that distinction is only visible from the retrieval side, which reinforces the case for the two-dimensional set quality signal you proposed. The behavioral layer tells you that grounding failed; the retrieval side explains whether that failure was due to quality density or coverage diversity. That asymmetry is actually useful — each system contributes something the other cannot produce.

One thing worth noting: the spec treats the retrieval set as a relatively static object. From the behavioral side, failures often depend not just on what was retrieved, but on how parts of that set are selected, ignored, or recombined across steps. A high-quality, high-diversity set can still fail depending on how the agent navigates it. So even with full provenance, attribution doesn’t fully collapse to properties of the retrieved data — the transformation structure matters as well.

The asymmetry confirmation is useful — it means the two systems are not redundant, they are genuinely complementary. Coverage diversity is only visible from the retrieval side. Grounding failure detection is only visible from the behavioral side. Neither can do the other’s job.

Your point about the retrieval set not being static is the right next problem to name. The spec as written treats the retrieval set as a fixed input. But in practice the agent does not consume the full set uniformly — it selects,
weighs, ignores, and recombines across steps. A five-chunk retrieval set where the agent only uses two chunks is a different effective input than the same five chunks where all five contribute.

That means the three-way diagnostic needs a fourth dimension: utilisation.

The retrieval set has properties (quality density, coverage diversity). The agent’s use of that set also has properties — specifically, what fraction of the available grounding signal was actually consumed.

If the behavioral layer can emit which tool outputs or intermediate states were derived from which parts of the retrieved content — even approximately — then utilisation becomes measurable:

{
chunk_set_hash: string,
quality_density: float,
coverage_diversity: float,
utilisation_ratio: float,
redundancy_ratio: float
}

Where utilisation_ratio is the proportion of chunks in the retrieval set that demonstrably contributed to the agent’s output versus those that were retrieved but ignored.

The diagnostic then becomes:

High quality + high diversity + high utilisation

  • failure = genuine transformation problem. The agent had good data, used it, and still failed. Fix the generation or reasoning layer.

High quality + high diversity + low utilisation

  • failure = navigation problem. The data was sufficient but the agent did not use what was available. Fix the agent’s selection logic.

High quality + low diversity + failure = retrieval composition problem. Fix the retrieval strategy.

Low quality + failure = data problem. Fix upstream.

The utilisation signal is the one that can only come from the behavioral side. Your system sees what the agent actually used. The retrieval side only sees what was offered.

Is utilisation ratio something your behavioral layer could approximate from the execution trace — even coarsely?

The four-dimensional matrix lands well — adding utilisation as a behavioral-side dimension to your retrieval-side quality and coverage signals captures a class of failures that neither layer can describe alone. Navigation problems (high quality, high diversity, low utilisation) are the case I had in mind when flagging that the retrieval set is not consumed uniformly. Naming that as a fourth axis rather than as a refinement of one of the existing three feels right.

To your direct question: yes, the behavioral layer can approximate utilisation_ratio from the execution trace, coarsely. Implementation shipped in Atlas v0.1.5 and Debugger v0.4.0 (today).

The mechanism is text-overlap proxy: distinctive tokens (4+ chars and numerics, minus stop words) are extracted from each retrieved chunk’s content and from the final response, and chunks with overlap above a 0.30 threshold (chunk-token side) are flagged as used_chunk_ids. The adapter emits this alongside retrieved_ids; the Debugger aggregates it into summary.execution_quality.utilisation as {ratio, used_count, retrieved_count, method: "text_overlap_proxy"}. The proxy method is carried explicitly in the output so consumers know not to treat it as ground truth.

Two limitations are worth naming up front rather than burying in fine print:

  1. Redundant retrieval sets defeat the proxy. When all retrieved chunks share vocabulary (a coverage-diversity failure on your side), the proxy flags all of them as “used” because they all overlap the response on the same tokens. The behavioral layer cannot distinguish “used” from “incidentally overlaps” in this case. Coverage-diversity is genuinely only visible from the retrieval side.

  2. Heavy paraphrase causes false negatives. When the agent rewrites retrieved content with different vocabulary, the proxy underestimates utilisation. This is the opposite failure mode and would require attribution from intermediate LLM steps to address — fragile and prompt-dependent.

A working PoC with two contrasting scenarios is in the Atlas repository:

  • examples/rag_chunk_diagnosis/raw_log.json — five redundant chunks, all describe the same query facet. Proxy returns ratio=1.00 (the limitation case above).
  • examples/rag_chunk_diagnosis/raw_log_navigation.json — five chunks split between numeric data and explanatory analyst notes. The agent uses only the numeric chunks. Proxy returns ratio=0.40 — the navigation case is detected.

Both are constructed scenarios with controlled logs. Worth flagging: when I ran the navigation scenario through gpt-4o-mini against the same context, the model used the analyst notes correctly and produced a grounded answer. The proxy detects navigation failures when they happen, but the failures themselves appear strongly model- and prompt-dependent. The diagnostic is most useful as a regression signal — when an agent that previously navigated well starts ignoring evidence — rather than as a universal failure detector.

Field-level mapping between your Provenance Spec v1.1 and the current Atlas/Debugger output is in examples/rag_chunk_diagnosis/spec_v1_1_mapping.md. The honest summary: chunk_id, retrieval_scores, and used_chunk_ids are now joinable on chunk_id. chunk_set_hash is one line of derivation away. The transformation-anchor block (step_id, ancestor_retrieval_set_ids) is the substantive gap — current telemetry is run-aggregated, not step-keyed, so step-level lineage would be a structural change rather than a new field.

The remaining open question follows from the static-object point I raised earlier: utilisation as I’ve implemented it inherits the same constraint — overlap with the final response, not “how the agent navigated through the set across steps.” A high-utilisation ratio with a final grounding failure could mean the agent used most chunks but combined them incorrectly (transformation problem), or it could mean the proxy is over-counting on a redundant set (proxy artifact). The four-dimensional matrix surfaces these cases, but doesn’t yet cleanly separate them. Is step-level utilisation tracking — where the proxy runs against intermediate states rather than just the final response — a direction you’d want the behavioral layer to extend toward, or does that cross into territory where the retrieval-side instrumentation should carry the lineage instead?

PyPI: pip install -U llm-failure-atlas agent-failure-debugger
GitHub: llm-failure-atlas/examples/rag_chunk_diagnosis at main · kiyoshisasano/llm-failure-atlas · GitHub

Wow! this is substantial — shipping the implementation with honest limitation documentation and a spec mapping
moves this from architecture discussion to integration specification. The text-overlap proxy is the right first-pass approach and the two limitation cases are exactly what I would expect.

On the redundant-set limitation: this is the case where the two-dimensional signal is strictly necessary rather than merely useful. When coverage_diversity is low, the behavioral layer should discount the utilisation_ratio because the proxy cannot distinguish “used” from “incidentally overlaps on shared vocabulary.”
The diagnostic logic becomes:

if coverage_diversity < threshold: utilisation_ratio is unreliable classify as retrieval composition problem regardless of proxy utilisation value

That keeps the proxy useful for the cases where it works (diverse retrieval sets) and prevents it from producing false confidence on the cases where it does not (redundant sets). The coverage_diversity signal from the retrieval side effectively gates whether the utilisation signal from the behavioral side should be trusted. Each system tells the other when its signal is valid. On the paraphrase limitation: this is structurally harder to solve without intermediate-step attribution, which brings us to your question.

My view: step-level lineage should be carried by the retrieval side, not the behavioral side. The reasoning is about where the information originates. The retrieval pipeline knows which chunks were fetched at each step because it executed the retrieval. It can tag each retrieval event with a step_id at the point of execution with minimal structural change — one additional field on the retrieval log entry. The behavioral side would then consume step_id passively,
same as it consumes chunk_set_hash.

If the behavioral side tries to infer step-level lineage by running the overlap proxy against intermediate states, it encounters the same paraphrase problem at every step boundary, compounding the error. Each intermediate state is a transformation of the previous one, progressively diverging from the original retrieved vocabulary. The proxy
degrades with each step. The retrieval side does not have this problem because it is recording facts (which chunks were fetched at which step) rather than inferring attribution from text similarity. Facts do not degrade across steps.

So the division becomes:

Retrieval side provides:

  • chunk_id (fact)
  • retrieval_set_envelope with step_id (fact)
  • pre_ingestion_quality (fact)
  • coverage_diversity (computed once from facts)

Behavioral side provides:

  • grounding success/failure (observed)
  • utilisation_ratio (proxy — gated by coverage_diversity)
  • error conditions (observed)

The transformation_anchor gap you identified is real but it is a retrieval-side gap, not a behavioral-side gap. Adding step_id to the retrieval event log is a smaller structural change than building step-level attribution into the behavioral trace.

The PoC scenarios and spec mapping are useful — I will update Provenance Spec to v1.2 incorporating the
trust-gating logic for utilisation_ratio and the step-level lineage ownership decision. I will reference the Atlas field mapping directly. Is the spec_v1_1_mapping.md stable enough to treat as the integration contract, or is there structural change expected in the next Atlas release?

The trust-gating logic is the right framing. Treating coverage_diversity as a validity gate for utilisation_ratio inverts the usual relationship between two signal sources — instead of combining them, each one tells the other when its measurement is meaningful. That maps cleanly onto the asymmetry we already established: each side has visibility the other does not, and now each side can also tell the other when its visibility is sufficient. I will document this gating logic on the Atlas side so consumers downstream of summary.execution_quality.utilisation know when to discount it.

On step-level lineage ownership: agreed, retrieval-side is the right home. The facts-vs-inference distinction you draw is the cleanest argument for it. The behavioral side running an overlap proxy against intermediate states would compound the paraphrase problem at every step boundary, exactly as you describe. The proxy degrades; recorded facts do not. Carrying step_id through the retrieval event log as a passive field that the behavioral side can join against is structurally smaller and semantically cleaner.

That settles the responsibility split:

  • Retrieval side: chunk_id, step_id, retrieval_set envelope, pre_ingestion_quality, coverage_diversity (facts and computations over facts)
  • Behavioral side: grounding success/failure, utilisation_ratio (gated by coverage_diversity), error conditions (observations and gated proxies)

The transformation_anchor block I flagged as a behavioral-side gap is, by this division, not a behavioral-side concern at all. I will note this in the next mapping document update so it is not misread as a missing feature.

On stability of spec_v1_1_mapping.md as integration contract: the field names, types, and semantics for the retrieval and grounding sections are stable. The names (retrieved_ids, retrieval_scores, mean_retrieval_score, used_chunk_ids, utilisation_method) and their meanings will not change in v0.2.x — additions are possible but breaking changes are not. chunk_set_hash is a planned non-breaking addition (one-line derivation from sorted retrieved_ids); when it lands, the contract extends rather than changes. So: stable enough to treat as the integration contract, with the caveat that future additions may give Spec v1.2 more to reference, not less.

On a related note — the layered separation you have arrived at (facts on the retrieval side, observations and gated proxies on the behavioral side) maps onto a two-stage observation model I explore as a parent framework called PLD (Phase-Loop Dynamics), where stability is defined over trajectory statistics rather than per-step state. The adapter / matcher separation in Atlas is the runtime instantiation of that boundary; the facts-vs-inference division you drew is the same boundary expressed in integration terms. Not directly relevant to the integration spec — treat it as optional conceptual grounding. Working notes if useful: agent-pld-metrics/docs/theory/control_theory_formulation.md at main · kiyoshisasano/agent-pld-metrics · GitHub

Looking forward to v1.2.

This settles the spec cleanly. The responsibility split is now unambiguous:

Retrieval side: facts and computations over facts

Behavioral side: observations and gated proxies

The stability confirmation on spec_v1_1_mapping.md is the key practical outcome — it means both sides
can build against a stable contract without coordination overhead. Non-breaking additions extending the contract rather than changing it is the right versioning discipline.

The text-overlap proxy implementation is a solid first pass. The two limitation cases you named —redundant sets defeating the proxy and heavy paraphrase causing false negatives — are exactly the cases where the trust-gating from coverage_diversity earns its place. The proxy works when the retrieval set is diverse. The retrieval side tells the behavioral side when that condition holds. Clean separation.

The regression signal framing is important and worth emphasising — utilisation_ratio is most valuable as a drift detector for agents that previously navigated well, not as a universal failure classifier. That scopes the signal
correctly and avoids overpromising.

I will study the PLD working notes — the trajectory-statistics framing for stability sounds like it formalises the intuition behind the four-way diagnostic in a way that could generalise beyond this specific integration.

Provenance Spec v1.2 will incorporate:

  • Trust-gating logic (coverage_diversity gates utilisation_ratio)
  • Step-level lineage as retrieval-side responsibility
  • Settled responsibility split
  • Atlas v0.1.5 field mapping as integration contract reference
  • chunk_set_hash as planned non-breaking addition

This has been a genuinely productive thread, thank you. The spec exists because two systems built independently needed a bridge — and now the bridge has a stable contract on both sides. Worth revisiting when either side ships the next substantive change against it.