agentrial: statistical testing for LangGraph agents (open source)

alepot55 · February 12, 2026, 7:40pm

Hey everyone,

I built a CLI tool for testing LangGraph agents with statistical rigor. Sharing here because I think it might be useful for others dealing with the “my agent works sometimes” problem.

**The problem I was trying to solve**

I have a LangGraph agent with a few tools. It works most of the time, but occasionally picks the wrong tool or hallucinates parameters. Running the same test twice gives different results. Standard unit tests don’t really work because the system is non-deterministic.

**What agentrial does**

It runs each test N times (default 10) and gives you:

- Pass rate with confidence intervals (Wilson score)

- Step-level failure attribution (which step diverged between passing and failing runs)

- Real cost tracking from `usage_metadata`

**LangGraph integration**

There’s a native adapter that uses LangChain’s callback system to capture trajectories:

```python

from agentrial.runner.adapters import wrap_langgraph_agent

agent = wrap_langgraph_agent(your_compiled_graph)

```

It captures tool calls, observations, and LLM responses with token counts automatically.

**Example output**

```

┌──────────────────────┬────────┬──────────────┬──────────┐

│ Test Case │ Pass │ 95% CI │ Avg Cost │

├──────────────────────┼────────┼──────────────┼──────────┤

│ easy-multiply │ 100.0% │ 72.2%-100.0% │ $0.0005 │

│ tool-selection │ 90.0% │ 59.6%-98.2% │ $0.0006 │

│ multi-step-task │ 70.0% │ 39.7%-89.2% │ $0.0011 │

└──────────────────────┴────────┴──────────────┴──────────┘

Failures: tool-selection (90% pass rate)

Step 0: called ‘calculate’ instead of ‘lookup_country_info’

```

**Test cases in YAML**

```yaml

suite: my-langgraph-agent

agent: my_module.agent

trials: 10

threshold: 0.85

cases:

name: country-lookup

input:

query: “What is the capital of Japan?”

expected:

output_contains: [“Tokyo”]

tool_calls:
```
- tool: lookup_country_info
```

```

**Links**

- GitHub: GitHub - alepot55/agentrial: Statistical evaluation framework for AI agents

- PyPI: `pip install agentrial`

Tested it with Claude 3 Haiku on a 3-tool agent - 100 trials cost about $0.06.

Would love feedback, especially on what metrics would be useful for your LangGraph workflows. Also happy to answer questions about the implementation.

simon.budziak · February 12, 2026, 10:08pm

Hello @alepot55 , It would be more readable for everyone if you could use some styling such as headings, preformatted text and other styling effects that are accessible when you create a message at the top.

Topic		Replies	Views
Merge or Die: a GitHub game for testing AI agents Talking Shop	0	47	March 8, 2026
Token bloat when using create_agent LangGraph intro-to-langgraph , python-help	4	239	November 25, 2025
ArkSim: a Testing framework for LangChain/LangGraph Talking Shop	1	130	April 2, 2026
multi-agent systems debugging agent-to-agent LangGraph python-help	1	13	April 20, 2026
A reusable orchestration pattern for stabilizing LangGraph agents (RAG, Tools, Memory) LangGraph intro-to-langgraph , python-help	5	281	April 20, 2026

agentrial: statistical testing for LangGraph agents (open source)

Related topics