When is it actually a failure? Diagnosing agent behavior beyond LangGraph traces

Agents often:

  • return plausible but incorrect answers
  • continue after tools return no data
  • quietly fall back to general knowledge

LangGraph + tracing tools (LangSmith, etc.) make it easy to see what happened.

But in practice:

it’s still hard to tell whether the behavior is actually a failure.


Example (see screenshot)

In this run:

  • the tool returned no data
  • the agent acknowledged the gap
  • and still produced a general answer

The system evaluates it as:

  • no failure detected
  • risk: LOW → because the agent explicitly disclosed the lack of grounding
  • no fix needed

Failure vs Acceptable (explicit definition)

In this system:

  • Acceptable
    → no data + uncertainty acknowledged

  • Failure
    → no data + confident answer without grounding


Approach

A lightweight diagnostic layer on top of LangGraph:

from adapters.callback_handler import watch

graph = watch(workflow.compile(), auto_diagnose=True)
result = graph.invoke({"messages": [...]})

No changes to the workflow.


How the classification works

Currently rule-based (no LLM) for:

  • deterministic behavior
  • easier debugging
  • no additional model cost

Using runtime signals such as:

  • tool outputs (data vs empty/error)
  • uncertainty patterns (e.g. “I couldn’t find”, “no data available”)
  • execution flow (tool usage vs direct answer)

Example signals:

  • tool_provided_data = False → no grounding
  • uncertainty_acknowledged = True/False
  • answer without data → fallback or hallucination

What this adds beyond tracing

LangGraph / LangSmith:
→ execution traces

This layer adds:

  • structural interpretation
  • failure classification
  • risk assessment
  • action recommendation
trace → signals → interpretation → risk → action

Example actions:

  • block answer generation
  • retry with different tool

Scope

This example focuses on fallback vs hallucination,
but the same framework extends to:

  • hallucination after tool failure
  • tool call loops
  • retrieval drift in RAG systems

Limitations

This is intentionally heuristic and imperfect:

  • may misclassify nuanced phrasing
  • depends on pattern detection (not full semantic understanding)

Designed as a debugging signal, not strict evaluation.


Code

If you want to see how this works in practice:

Minimal example (~10 sec, no API key)

Full system:


Closing

The goal is not to be perfectly correct,
but to make failure modes explicit and debuggable.


Looking for feedback

Interested in real-world cases like:

  • hallucination after tool failure
  • silent tool loops
  • confident but irrelevant outputs

How are you distinguishing
“failure” vs “acceptable behavior” in production?

The three failure modes you listed — plausible but wrong, continuing after empty tool returns, falling back to general knowledge — have a common root cause that traces don’t surface. In most cases: the retrieved chunks were technically present but informationally insufficient. The agent didn’t fail because the retriever missed. It failed because what was retrieved looked relevant but lacked the specific content needed to answer correctly.

LangGraph traces show you the retrieval happened. They don’t show you whether the retrieved chunk was semantically dense enough to be useful, or whether the critical information was split across two chunks during ingestion and neither half contains a complete thought. The diagnostic question that’s missing from most agent
observability setups:

For each retrieval step that preceded a wrong answer - what was the intrinsic quality of the chunks returned?
Not “were they the most similar?” but “did they actually contain answerable information?” Once you score the chunks themselves, the pattern usually becomes obvious. Certain document sources or certain chunking boundaries produce consistently low-quality chunks.
Those are the specific areas to fix — not the agent architecture, not the prompt, not the retriever parameters.

The agent is often behaving correctly given what it was given. The problem is what it was given!

1 Like

Good point — I agree that chunk quality is often the underlying issue. In many cases, the agent is behaving correctly given its inputs, but those inputs are insufficient or fragmented.

This is intentionally outside the current scope. The toolchain focuses on agent behavioral signals (e.g., whether the agent used tool outputs, acknowledged uncertainty, or entered loops) rather than evaluating the quality of retrieved content itself. The design is deliberately deterministic and embedding-free: it can detect “the agent answered without grounding”, but not “the grounding existed but was informationally insufficient.”

What you’re describing — scoring chunks based on intrinsic answerability rather than similarity — is a complementary upstream layer.

In that sense, the separation becomes:

  • Upstream (your approach): Diagnose input quality
    → “Did the retrieved chunks actually contain the information needed to answer?”

  • Downstream (this toolchain): Diagnose agent behavior
    → “Given what it received, did the agent respond appropriately?”

Both are necessary.
This layer tells you that a failure occurred after retrieval.
Your layer explains why — the chunks appeared relevant but lacked sufficient informational density.