Overview¶
agentprdiff is a Python library and CLI for snapshot testing of LLM
agents. It records what your agent did on a known-good run, commits that
record to git, and on every subsequent run computes a structured trace
delta — pass/fail flips, cost change, latency change, tool-call sequence
changes, and a unified text diff of the final output. If anything regressed,
the CLI exits non-zero and your CI build fails.
Despite the name,
agentprdiffdoes not parse GitHub pull-request diffs. It produces the diff that a PR reviewer needs in order to reason about an agent change: what did the agent used to do, what does it do now, and which assertions about its behavior just flipped?
What problem does it solve?¶
Three problems collapse into one:
- You can't unit-test a stochastic system. Asserting
output == expectedon an LLM call is wrong by construction. - LLM-as-judge evals are too slow and too expensive to run on every PR. They're great for offline benchmarking, not for the inner-loop CI gate.
- Hidden behavioral drift ships to production. A model bump from
gpt-4otogpt-4o-mini, a rewritten system prompt, or a vendor swap silently changes which tool fires, what gets quoted to the user, or how much a query costs — and you find out from the support queue.
agentprdiff sits in the middle. The 80 % of agent behavior you can encode
as deterministic rules ("the lookup_order tool was called", "the word
refund appeared", "cost stayed under $0.02") becomes a fast, free,
deterministic CI check. The 20 % that needs judgment becomes a semantic()
grader with a pluggable LLM judge — and a fake_judge fallback so CI stays
green without API keys.
Why it exists¶
You upgraded Claude. You tweaked a system prompt. You swapped
gpt-4oforgpt-4o-miniin the cheap path. Which of your agent's behaviors just changed?agentprdifftells you — before the PR merges.
It is not a framework. Your agent stays exactly the way it is.
agentprdiff records what it did, lets you assert what should be true about
what it did, and compares runs across time.
Key features¶
| Capability | What it gives you |
|---|---|
| AI-agent adoption playbook | An AGENTS.md at the repo root that Claude Code, Cursor, Aider, or any agentic IDE reads to add the suite for you — finds the production agent, proposes cases, generates files, records baselines. ~15 minutes end-to-end. |
agentprdiff scaffold |
One command stamps out the canonical layout with TODO: markers — same files the AI agent produces, just empty. |
Tiny Suite / Case model |
One Python file, no DSL, no YAML. |
| Ten batteries-included graders | contains, contains_any, regex_match, tool_called, tool_sequence, no_tool_called, output_length_lt, latency_lt_ms, cost_lt_usd, semantic. |
| Pluggable LLM judge | OpenAI, Anthropic, custom callable, or a deterministic fake_judge. |
| JSON baseline store | Committed under .agentprdiff/baselines/; reviewers see the diff in PRs. |
| Trace differ | Per-case TraceDelta: assertion changes, cost / latency / token deltas, tool-sequence diff, unified output diff. |
| Five-command CLI | init, record, check, review, scaffold, diff. |
| OpenAI + Anthropic SDK adapters | One with block auto-records every model and tool call (sync and async OpenAI). |
| OpenAI-compatible providers | Groq, Gemini, OpenRouter, Ollama, vLLM, Together, Fireworks, DeepInfra. |
| Case filters | --case, --skip, globs, negation, --list — like pytest -k. |
| Local iteration loop | agentprdiff review always exits 0 with a verbose per-case panel. |
| CI-friendly outputs | Rich terminal table, --json-out artifact, exit 1 on regression. |
High-level architecture¶
flowchart LR
A[Your agent code] -->|"agent(input)"| R[Runner]
S[suite.py] -->|loads suites| L[Loader]
L --> R
R -->|run_agent| T[Trace]
R -->|grade| G[Graders]
G --> GR[GradeResult]
R --> ST[BaselineStore]
ST -->|"record / check"| FS[(.agentprdiff/<br/>baselines + runs)]
R --> D[Differ]
D --> TD[TraceDelta]
TD --> RP[Reporters]
RP --> TERM[Terminal]
RP --> JSON[JSON artifact]
The path through the system on a single CLI invocation is:
- Loader imports your suite file and harvests every module-level
Suite. - Runner invokes
agent(case.input)for each case, building aTrace. - Graders run against the resulting trace, producing
GradeResults. - BaselineStore either saves (
recordmode) or loads (checkmode) the baseline JSON for that suite/case. - Differ computes a
TraceDelta(cost / latency / tokens / tool sequence / output / per-grader pass-fail flips). - Reporters render to the terminal and optionally to a JSON artifact.
- The CLI exits
1when any case has a regression (pass→fail flip, new exception, missing baseline + failing assertion).
The fastest adoption path¶
Open Claude Code, Cursor, Aider, or any agentic IDE in your project and
paste the recommended adoption prompt.
The assistant reads AGENTS.md, finds your production agent, asks you
to confirm 5-10 case contracts, and writes the entire suite + CI
workflow. You stay in the loop for two checkpoints (~3 minutes of
review). Total time: ~15-20 minutes from pip install to green CI.
If you'd rather hand-write the suite, the same Quickstart has a manual path that walks through every file.
Where to next¶
- New here? Jump to the Quickstart.
- Curious about the design? Core concepts.
- Want runnable patterns? Scenarios & examples.
- Looking for a function or flag? API or CLI reference.