Hey everyone,
I built a CLI tool for testing LangGraph agents with statistical rigor. Sharing here because I think it might be useful for others dealing with the “my agent works sometimes” problem.
**The problem I was trying to solve**
I have a LangGraph agent with a few tools. It works most of the time, but occasionally picks the wrong tool or hallucinates parameters. Running the same test twice gives different results. Standard unit tests don’t really work because the system is non-deterministic.
**What agentrial does**
It runs each test N times (default 10) and gives you:
- Pass rate with confidence intervals (Wilson score)
- Step-level failure attribution (which step diverged between passing and failing runs)
- Real cost tracking from `usage_metadata`
**LangGraph integration**
There’s a native adapter that uses LangChain’s callback system to capture trajectories:
```python
from agentrial.runner.adapters import wrap_langgraph_agent
agent = wrap_langgraph_agent(your_compiled_graph)
```
It captures tool calls, observations, and LLM responses with token counts automatically.
**Example output**
```
┌──────────────────────┬────────┬──────────────┬──────────┐
│ Test Case │ Pass │ 95% CI │ Avg Cost │
├──────────────────────┼────────┼──────────────┼──────────┤
│ easy-multiply │ 100.0% │ 72.2%-100.0% │ $0.0005 │
│ tool-selection │ 90.0% │ 59.6%-98.2% │ $0.0006 │
│ multi-step-task │ 70.0% │ 39.7%-89.2% │ $0.0011 │
└──────────────────────┴────────┴──────────────┴──────────┘
Failures: tool-selection (90% pass rate)
Step 0: called ‘calculate’ instead of ‘lookup_country_info’
```
**Test cases in YAML**
```yaml
suite: my-langgraph-agent
agent: my_module.agent
trials: 10
threshold: 0.85
cases:
-
name: country-lookup
input:
query: “What is the capital of Japan?”
expected:
output_contains: [“Tokyo”]
tool_calls:
- tool: lookup_country_info
```
**Links**
- GitHub: GitHub - alepot55/agentrial: Statistical evaluation framework for AI agents
- PyPI: `pip install agentrial`
Tested it with Claude 3 Haiku on a 3-tool agent - 100 trials cost about $0.06.
Would love feedback, especially on what metrics would be useful for your LangGraph workflows. Also happy to answer questions about the implementation.