LLM Evaluation Tooling
agenteval
Behavioral eval for agents: replaces brittle exact-match asserts with repeated-run pass-rate scoring for CI gates.
pass-rate
statistical scoring, not assert-equals
- Run a test N times; pass if the success rate clears a threshold
- Collect-then-raise behavioral assertions on the agent's trace
- Traces every tool call — name, args, result, timing, steps
- Typer CLI with JSON reports + exit codes for CI gates
The problem
Exact-match assertions don't survive non-determinism
The same prompt gives an agent different tool sequences, wording, and conclusions on every run. So the moment you write assert result == "expected", you've already lost — the test is flaky by construction.
Real agents are reliable statistically: right 85% of the time, not always. agenteval tests for exactly that — a pass rate over repeated runs— which turns “it felt worse this week” into a number you can gate a CI pipeline on.
How it works
Trace, run N times, score, report
Tracer
Wrap tools with tracer.wrap() / @tracer.tool — records name, args, result, timing, exceptions per call.
Runner
Executes the test function N times concurrently, capturing an AgentTrace for each run.
Assertions
A fluent chain collects every failure before raising — behavioral checks on the trace, not the string.
Reporter
Terminal summary + JSON export; pass rate vs. threshold becomes a CI exit code.
The API
A test is a behavior and a threshold
import agentevalfrom agenteval import Tracer@agenteval.test(n=20, threshold=0.85, tags=["search"])async def test_agent(tracer: Tracer) -> None:search = tracer.wrap(web_search)async with tracer.run(input="query") as run:result = await my_agent("query", search=search)run.set_output(result)tracer.assert_that().called_tool("web_search").no_errors().check()
tracer.assert_that() \.called_tool("search") \.tool_call_count("search", min=1, max=3) \.completed_within_steps(8) \.completed_within_seconds(15.0) \.response_matches_schema(MyPydanticModel) \.no_errors() \.check()
collect-then-raise
The chain doesn't fail on the first error — it gathers all of them, so a single run tells you every expectation that broke, not just the first one.
See it run
A pass-rate gate in CI
Twenty runs, a 0.85 threshold. Three runs fail an assertion — but the suite still passes the gate, and you get the failing traces to debug.
agenteval run tests/ --n 20 --threshold 0.85
In CI
Run, gate, report
$ agenteval run tests/ --n 10 --threshold 0.9 --traces --output report.json$ agenteval report report.json
Exit codes: 0 pass · 1 fail · 2 error — drop it straight into a CI step.
What I built
Testing built for how agents actually behave
Pass-rate, not pass/fail
An agent that's right 85% of the time is a known quantity; a single run that happens to pass is luck. Tests assert a reliability threshold over N runs, so you can track regressions instead of chasing flakes.
Collect-then-raise assertions
A fluent assert_that() chain gathers every failure across the run before raising — so one failed expectation doesn't hide the other three. You see the whole picture per run, not just the first crash.
Behavioral, not string-matching
Assert on what the agent did: called_tool, tool_call_count(min/max), completed_within_steps/seconds, response_matches_schema, no_errors — the trace, not the exact wording.
CI-native, framework-agnostic
Typer CLI with JSON reports and exit codes (0 pass / 1 fail / 2 error). OpenAI, Anthropic, and LangChain adapters wrap existing tools without changing the agent.
Get started
Install from PyPI
$ pip install agenteval-py$ pip install "agenteval-py[all]"
Python 3.11+ · adapters for OpenAI, Anthropic, and LangChain.