Graders Reference¶
A grader is Callable[[Trace], GradeResult]. agentprdiff ships ten —
nine deterministic, one semantic.
from agentprdiff.graders import (
contains, contains_any, regex_match,
tool_called, no_tool_called, tool_sequence,
output_length_lt, latency_lt_ms, cost_lt_usd,
semantic, fake_judge,
)
Deterministic graders¶
These are cheap, free, reproducible. Reach for them first.
contains(substring, *, case_sensitive=False)¶
Pass iff the agent's final output contains substring.
grader_name: contains('refund').
reason: output contains 'refund' or output does not contain 'refund'.
contains_any(substrings, *, case_sensitive=False)¶
Pass iff the output contains at least one of the listed substrings.
Useful when several phrasings would all satisfy the contract.
regex_match(pattern, *, flags=0)¶
Pass iff pattern matches the output (re.search semantics).
import re
regex_match(r"\$\d+(\.\d{2})?") # any dollar amount
regex_match(r"^thank you", flags=re.MULTILINE | re.I) # opens politely
reason: matched 'foo' or no match for '<pattern>'.
tool_called(name, *, min_times=1)¶
Pass iff tool name was called at least min_times times.
reason: tool 'lookup_order' called N time(s), required >= M.
no_tool_called(name)¶
Pass iff tool name was not called.
tool_sequence(sequence, *, strict=False)¶
Pass iff the tool-call sequence matches sequence.
tool_sequence(["authenticate", "lookup_order"]) # subsequence (default)
tool_sequence(["authenticate", "lookup_order"], strict=True) # exact equality
| Mode | Behavior | When to use |
|---|---|---|
strict=False |
sequence must appear as a subsequence. Other tools may interleave. |
Lock the order of important tools without forbidding new ones. |
strict=True |
The actual list must equal sequence exactly. |
Lock the entire pipeline shape — tighter contract. |
output_length_lt(max_chars)¶
Pass iff len(output) < max_chars.
latency_lt_ms(max_ms)¶
Pass iff trace.total_latency_ms < max_ms.
total_latency_ms is the sum of every recorded LLMCall.latency_ms and
ToolCall.latency_ms. Set it accurately in your agent (or use an SDK
adapter, which sets it for you) — otherwise this grader trivially passes.
cost_lt_usd(max_usd)¶
Pass iff trace.total_cost_usd < max_usd.
total_cost_usd is the sum of LLMCall.cost_usd. The OpenAI / Anthropic
adapters fill it from the bundled price table; manual instrumentation has
to set it yourself.
Semantic grader¶
semantic(rubric, *, judge=None)¶
Pass iff judge(rubric, trace) returns (True, _).
from agentprdiff.graders import semantic
semantic("agent acknowledged the refund and explained the timeline")
The default judge is selected by env vars (AGENTGUARD_JUDGE,
OPENAI_API_KEY, ANTHROPIC_API_KEY) — see
Configuration → Selecting the semantic-grader judge.
Built-in judges¶
from agentprdiff.graders.semantic import fake_judge, openai_judge, anthropic_judge
# Deterministic; passes iff any rubric word ≥ 4 chars appears in output. Free.
semantic("…", judge=fake_judge)
# OpenAI Chat Completions (default model gpt-4o-mini).
semantic("…", judge=openai_judge(model="gpt-4o-mini"))
# Anthropic Messages API (default model claude-haiku-4-5-20251001).
semantic("…", judge=anthropic_judge(model="claude-haiku-4-5-20251001"))
Custom judges¶
A judge is Callable[[str, Trace], tuple[bool, str]]. Anything matching
that signature is fair game — see
Customization → Custom semantic-grader judges
for examples (regex, embedding similarity, finetuned classifier).
Why not always use semantic?¶
| Tradeoff | Deterministic | Semantic |
|---|---|---|
| Speed | µs | seconds |
| Cost | free | $$$ per call |
| Determinism | yes | no (without temperature=0 and even then…) |
| Catches subtle behavior | no | yes |
| Runs free in CI | yes | only with fake_judge (which doesn't actually judge) |
Encode the 80 % you can express as a rule with deterministic graders.
Reserve semantic() for the last 20 %. The bundled fake_judge exists
so the absence of API keys in CI doesn't drop you off the green-build
contract — it never lies about being a real judge, but it keeps the
pipeline running.
Picking the right grader¶
| Behavior to pin | Grader |
|---|---|
| A specific phrase appears | contains |
| One of N phrases | contains_any |
| A pattern matches | regex_match |
| A specific tool fires | tool_called |
| A specific tool doesn't fire | no_tool_called |
| Tools fire in order | tool_sequence |
| Stay terse | output_length_lt |
| Latency budget | latency_lt_ms |
| Cost budget | cost_lt_usd |
| "The agent was empathetic / on-brand / accurate" | semantic |
Composing graders¶
Graders are independent — pass them all in a single list, the runner evaluates each, and the case passes iff all of them pass:
case(
name="refund_happy_path",
input="…",
expect=[
contains("refund"),
regex_match(r"\$\d+\.\d{2}"),
tool_called("lookup_order"),
no_tool_called("send_email"),
tool_sequence(["authenticate", "lookup_order"]),
output_length_lt(800),
latency_lt_ms(5_000),
cost_lt_usd(0.02),
semantic("agent explains the refund timeline"),
],
)
There's no AND/OR/NOT combinator language — that's by design. If you need OR, write a custom grader that calls two built-ins and returns a combined result.