Graders Reference¶

A grader is Callable[[Trace], GradeResult]. agentprdiff ships ten — nine deterministic, one semantic.

from agentprdiff.graders import (
    contains, contains_any, regex_match,
    tool_called, no_tool_called, tool_sequence,
    output_length_lt, latency_lt_ms, cost_lt_usd,
    semantic, fake_judge,
)

Deterministic graders¶

These are cheap, free, reproducible. Reach for them first.

`contains(substring, *, case_sensitive=False)`¶

Pass iff the agent's final output contains substring.

contains("refund")                 # case-insensitive (default)
contains("REFUND", case_sensitive=True)

grader_name: contains('refund'). reason: output contains 'refund' or output does not contain 'refund'.

`contains_any(substrings, *, case_sensitive=False)`¶

Pass iff the output contains at least one of the listed substrings.

contains_any(["refund", "credit", "compensation"])

Useful when several phrasings would all satisfy the contract.

`regex_match(pattern, *, flags=0)`¶

Pass iff pattern matches the output (re.search semantics).

import re
regex_match(r"\$\d+(\.\d{2})?")                       # any dollar amount
regex_match(r"^thank you", flags=re.MULTILINE | re.I) # opens politely

reason: matched 'foo' or no match for '<pattern>'.

`tool_called(name, *, min_times=1)`¶

Pass iff tool name was called at least min_times times.

tool_called("lookup_order")
tool_called("retry", min_times=2)

reason: tool 'lookup_order' called N time(s), required >= M.

`no_tool_called(name)`¶

Pass iff tool name was not called.

no_tool_called("send_email")           # don't email people from a status query

`tool_sequence(sequence, *, strict=False)`¶

Pass iff the tool-call sequence matches sequence.

tool_sequence(["authenticate", "lookup_order"])              # subsequence (default)
tool_sequence(["authenticate", "lookup_order"], strict=True) # exact equality

Mode	Behavior	When to use
`strict=False`	`sequence` must appear as a subsequence. Other tools may interleave.	Lock the order of important tools without forbidding new ones.
`strict=True`	The actual list must equal `sequence` exactly.	Lock the entire pipeline shape — tighter contract.

`output_length_lt(max_chars)`¶

Pass iff len(output) < max_chars.

output_length_lt(500)                   # keep replies terse

`latency_lt_ms(max_ms)`¶

Pass iff trace.total_latency_ms < max_ms.

latency_lt_ms(5_000)                    # under 5 s

total_latency_ms is the sum of every recorded LLMCall.latency_ms and ToolCall.latency_ms. Set it accurately in your agent (or use an SDK adapter, which sets it for you) — otherwise this grader trivially passes.

`cost_lt_usd(max_usd)`¶

Pass iff trace.total_cost_usd < max_usd.

cost_lt_usd(0.02)                       # under 2 cents per case

total_cost_usd is the sum of LLMCall.cost_usd. The OpenAI / Anthropic adapters fill it from the bundled price table; manual instrumentation has to set it yourself.

Semantic grader¶

`semantic(rubric, *, judge=None)`¶

Pass iff judge(rubric, trace) returns (True, _).

from agentprdiff.graders import semantic
semantic("agent acknowledged the refund and explained the timeline")

The default judge is selected by env vars (AGENTGUARD_JUDGE, OPENAI_API_KEY, ANTHROPIC_API_KEY) — see Configuration → Selecting the semantic-grader judge.

Built-in judges¶

from agentprdiff.graders.semantic import fake_judge, openai_judge, anthropic_judge

# Deterministic; passes iff any rubric word ≥ 4 chars appears in output. Free.
semantic("…", judge=fake_judge)

# OpenAI Chat Completions (default model gpt-4o-mini).
semantic("…", judge=openai_judge(model="gpt-4o-mini"))

# Anthropic Messages API (default model claude-haiku-4-5-20251001).
semantic("…", judge=anthropic_judge(model="claude-haiku-4-5-20251001"))

Custom judges¶

A judge is Callable[[str, Trace], tuple[bool, str]]. Anything matching that signature is fair game — see Customization → Custom semantic-grader judges for examples (regex, embedding similarity, finetuned classifier).

Why not always use semantic?¶

Tradeoff	Deterministic	Semantic
Speed	µs	seconds
Cost	free	$$$ per call
Determinism	yes	no (without `temperature=0` and even then…)
Catches subtle behavior	no	yes
Runs free in CI	yes	only with `fake_judge` (which doesn't actually judge)

Encode the 80 % you can express as a rule with deterministic graders. Reserve semantic() for the last 20 %. The bundled fake_judge exists so the absence of API keys in CI doesn't drop you off the green-build contract — it never lies about being a real judge, but it keeps the pipeline running.

Picking the right grader¶

Behavior to pin	Grader
A specific phrase appears	`contains`
One of N phrases	`contains_any`
A pattern matches	`regex_match`
A specific tool fires	`tool_called`
A specific tool doesn't fire	`no_tool_called`
Tools fire in order	`tool_sequence`
Stay terse	`output_length_lt`
Latency budget	`latency_lt_ms`
Cost budget	`cost_lt_usd`
"The agent was empathetic / on-brand / accurate"	`semantic`

Composing graders¶

Graders are independent — pass them all in a single list, the runner evaluates each, and the case passes iff all of them pass:

case(
    name="refund_happy_path",
    input="…",
    expect=[
        contains("refund"),
        regex_match(r"\$\d+\.\d{2}"),
        tool_called("lookup_order"),
        no_tool_called("send_email"),
        tool_sequence(["authenticate", "lookup_order"]),
        output_length_lt(800),
        latency_lt_ms(5_000),
        cost_lt_usd(0.02),
        semantic("agent explains the refund timeline"),
    ],
)

There's no AND/OR/NOT combinator language — that's by design. If you need OR, write a custom grader that calls two built-ins and returns a combined result.

Graders Reference¶

Deterministic graders¶

contains(substring, *, case_sensitive=False)¶

contains_any(substrings, *, case_sensitive=False)¶

regex_match(pattern, *, flags=0)¶

tool_called(name, *, min_times=1)¶

no_tool_called(name)¶

tool_sequence(sequence, *, strict=False)¶

output_length_lt(max_chars)¶

latency_lt_ms(max_ms)¶

cost_lt_usd(max_usd)¶