LLM Evaluation Tooling

agenteval

Behavioral eval for agents: replaces brittle exact-match asserts with repeated-run pass-rate scoring for CI gates.

pass-rate

statistical scoring, not assert-equals

AsyncIOOpenAI SDKAnthropic SDKLangChainTyper

GitHub ↗PyPI ↗

Run a test N times; pass if the success rate clears a threshold
Collect-then-raise behavioral assertions on the agent's trace
Traces every tool call — name, args, result, timing, steps
Typer CLI with JSON reports + exit codes for CI gates

The problem

Exact-match assertions don't survive non-determinism

The same prompt gives an agent different tool sequences, wording, and conclusions on every run. So the moment you write assert result == "expected", you've already lost — the test is flaky by construction.

Real agents are reliable statistically: right 85% of the time, not always. agenteval tests for exactly that — a pass rate over repeated runs— which turns “it felt worse this week” into a number you can gate a CI pipeline on.

How it works

Trace, run N times, score, report

instrument

Tracer

Wrap tools with tracer.wrap() / @tracer.tool — records name, args, result, timing, exceptions per call.

repeat

Runner

Executes the test function N times concurrently, capturing an AgentTrace for each run.

judge

Assertions

A fluent chain collects every failure before raising — behavioral checks on the trace, not the string.

gate

Reporter

Terminal summary + JSON export; pass rate vs. threshold becomes a CI exit code.

The API

A test is a behavior and a threshold

a pass-rate test

import agenteval
from agenteval import Tracer
 
@agenteval.test(n=20, threshold=0.85, tags=["search"])
async def test_agent(tracer: Tracer) -> None:
    search = tracer.wrap(web_search)
    async with tracer.run(input="query") as run:
        result = await my_agent("query", search=search)
        run.set_output(result)
    tracer.assert_that().called_tool("web_search").no_errors().check()

behavioral assertions are chainable + collected

tracer.assert_that() \
    .called_tool("search") \
    .tool_call_count("search", min=1, max=3) \
    .completed_within_steps(8) \
    .completed_within_seconds(15.0) \
    .response_matches_schema(MyPydanticModel) \
    .no_errors() \
    .check()

collect-then-raise

The chain doesn't fail on the first error — it gathers all of them, so a single run tells you every expectation that broke, not just the first one.

See it run

A pass-rate gate in CI

Twenty runs, a 0.85 threshold. Three runs fail an assertion — but the suite still passes the gate, and you get the failing traces to debug.

agenteval · run tests/

1/6

RUN

agenteval run tests/ --n 20 --threshold 0.85

▍

representative trace

In CI

Run, gate, report

cli

$ agenteval run tests/ --n 10 --threshold 0.9 --traces --output report.json
$ agenteval report report.json

Exit codes: 0 pass · 1 fail · 2 error — drop it straight into a CI step.

What I built

Testing built for how agents actually behave

Pass-rate, not pass/fail

An agent that's right 85% of the time is a known quantity; a single run that happens to pass is luck. Tests assert a reliability threshold over N runs, so you can track regressions instead of chasing flakes.

Collect-then-raise assertions

A fluent assert_that() chain gathers every failure across the run before raising — so one failed expectation doesn't hide the other three. You see the whole picture per run, not just the first crash.

Behavioral, not string-matching

Assert on what the agent did: called_tool, tool_call_count(min/max), completed_within_steps/seconds, response_matches_schema, no_errors — the trace, not the exact wording.

CI-native, framework-agnostic

Typer CLI with JSON reports and exit codes (0 pass / 1 fail / 2 error). OpenAI, Anthropic, and LangChain adapters wrap existing tools without changing the agent.

Get started

Install from PyPI

pip

$ pip install agenteval-py
$ pip install "agenteval-py[all]"

Python 3.11+ · adapters for OpenAI, Anthropic, and LangChain.

GitHub ↗PyPI ↗

← All projects