Agents often:
- return plausible but incorrect answers
- continue after tools return no data
- quietly fall back to general knowledge
LangGraph + tracing tools (LangSmith, etc.) make it easy to see what happened.
But in practice:
it’s still hard to tell whether the behavior is actually a failure.
Example (see screenshot)
In this run:
- the tool returned no data
- the agent acknowledged the gap
- and still produced a general answer
The system evaluates it as:
- no failure detected
- risk: LOW → because the agent explicitly disclosed the lack of grounding
- no fix needed
Failure vs Acceptable (explicit definition)
In this system:
-
Acceptable
→ no data + uncertainty acknowledged -
Failure
→ no data + confident answer without grounding
Approach
A lightweight diagnostic layer on top of LangGraph:
from adapters.callback_handler import watch
graph = watch(workflow.compile(), auto_diagnose=True)
result = graph.invoke({"messages": [...]})
No changes to the workflow.
How the classification works
Currently rule-based (no LLM) for:
- deterministic behavior
- easier debugging
- no additional model cost
Using runtime signals such as:
- tool outputs (data vs empty/error)
- uncertainty patterns (e.g. “I couldn’t find”, “no data available”)
- execution flow (tool usage vs direct answer)
Example signals:
tool_provided_data = False→ no groundinguncertainty_acknowledged = True/False- answer without data → fallback or hallucination
What this adds beyond tracing
LangGraph / LangSmith:
→ execution traces
This layer adds:
- structural interpretation
- failure classification
- risk assessment
- action recommendation
trace → signals → interpretation → risk → action
Example actions:
- block answer generation
- retry with different tool
Scope
This example focuses on fallback vs hallucination,
but the same framework extends to:
- hallucination after tool failure
- tool call loops
- retrieval drift in RAG systems
Limitations
This is intentionally heuristic and imperfect:
- may misclassify nuanced phrasing
- depends on pattern detection (not full semantic understanding)
Designed as a debugging signal, not strict evaluation.
Code
If you want to see how this works in practice:
Minimal example (~10 sec, no API key)
Full system:
- Atlas (detection):
GitHub - kiyoshisasano/llm-failure-atlas: A graph-based failure modeling and deterministic detection system for LLM agent runtimes. · GitHub - Debugger (interpretation + fixes):
GitHub - kiyoshisasano/agent-failure-debugger: A deterministic pipeline that diagnoses, explains, and safely auto-fixes failures in LLM agent systems. · GitHub
Closing
The goal is not to be perfectly correct,
but to make failure modes explicit and debuggable.
Looking for feedback
Interested in real-world cases like:
- hallucination after tool failure
- silent tool loops
- confident but irrelevant outputs
How are you distinguishing
“failure” vs “acceptable behavior” in production?
