agentrial: statistical testing for LangGraph agents (open source)

Hey everyone,

I built a CLI tool for testing LangGraph agents with statistical rigor. Sharing here because I think it might be useful for others dealing with the “my agent works sometimes” problem.

**The problem I was trying to solve**

I have a LangGraph agent with a few tools. It works most of the time, but occasionally picks the wrong tool or hallucinates parameters. Running the same test twice gives different results. Standard unit tests don’t really work because the system is non-deterministic.

**What agentrial does**

It runs each test N times (default 10) and gives you:

- Pass rate with confidence intervals (Wilson score)

- Step-level failure attribution (which step diverged between passing and failing runs)

- Real cost tracking from `usage_metadata`

**LangGraph integration**

There’s a native adapter that uses LangChain’s callback system to capture trajectories:

```python

from agentrial.runner.adapters import wrap_langgraph_agent

agent = wrap_langgraph_agent(your_compiled_graph)

```

It captures tool calls, observations, and LLM responses with token counts automatically.

**Example output**

```

┌──────────────────────┬────────┬──────────────┬──────────┐

│ Test Case │ Pass │ 95% CI │ Avg Cost │

├──────────────────────┼────────┼──────────────┼──────────┤

│ easy-multiply │ 100.0% │ 72.2%-100.0% │ $0.0005 │

│ tool-selection │ 90.0% │ 59.6%-98.2% │ $0.0006 │

│ multi-step-task │ 70.0% │ 39.7%-89.2% │ $0.0011 │

└──────────────────────┴────────┴──────────────┴──────────┘

Failures: tool-selection (90% pass rate)

Step 0: called ‘calculate’ instead of ‘lookup_country_info’

```

**Test cases in YAML**

```yaml

suite: my-langgraph-agent

agent: my_module.agent

trials: 10

threshold: 0.85

cases:

  • name: country-lookup

    input:

    query: “What is the capital of Japan?”

    expected:

    output_contains: [“Tokyo”]

    tool_calls:

    - tool: lookup_country_info
    

```

**Links**

- GitHub: GitHub - alepot55/agentrial: Statistical evaluation framework for AI agents

- PyPI: `pip install agentrial`

Tested it with Claude 3 Haiku on a 3-tool agent - 100 trials cost about $0.06.

Would love feedback, especially on what metrics would be useful for your LangGraph workflows. Also happy to answer questions about the implementation.

Hello @alepot55 , It would be more readable for everyone if you could use some styling such as headings, preformatted text and other styling effects that are accessible when you create a message at the top.