Core Concepts¶
agentprdiff has a small vocabulary. Internalize these five nouns and the
rest of the docs read like glue.
The five nouns¶
classDiagram
direction LR
class Suite {
+str name
+AgentFn agent
+list~Case~ cases
}
class Case {
+str name
+Any input
+list~Grader~ expect
+list~str~ tags
}
class Trace {
+str suite_name
+str case_name
+Any input
+Any output
+list~LLMCall~ llm_calls
+list~ToolCall~ tool_calls
+float total_cost_usd
+float total_latency_ms
+Optional~str~ error
}
class Grader {
<<callable>>
Trace -> GradeResult
}
class GradeResult {
+bool passed
+str grader_name
+str reason
}
Suite "1" --> "*" Case
Case "*" --> "*" Grader : expect
Suite ..> Trace : produces per case
Grader ..> GradeResult
Suite¶
A named group of Cases sharing one agent under test. Built with the
suite(name=, agent=, cases=, description=) factory.
Case¶
One input, plus the assertions that must hold for the resulting trace.
Built with case(name=, input=, expect=, tags=).
Trace¶
The complete record of one agent run. JSON-serializable, designed to be written to disk as a baseline and compared across versions of your agent. Tracks:
- the original
inputand finaloutput, - every
LLMCall(provider, model, tokens, cost, latency, output text), - every
ToolCall(name, arguments, result, latency, error), - aggregate metrics (
total_cost_usd,total_latency_ms, etc.), - a top-level
errorif the agent raised.
Grader¶
Any callable (Trace) -> GradeResult. Ten built-ins ship with the package;
you can add your own by writing a function. See
Graders reference.
GradeResult¶
{passed: bool, grader_name: str, reason: str}. The reason is what shows
up in the terminal table when an assertion fails — write reasons that are
useful to a future on-call engineer.
How a run flows internally¶
sequenceDiagram
autonumber
participant CLI
participant Loader
participant Runner
participant Agent as User Agent
participant Store as BaselineStore
participant Differ
participant Reporter
CLI->>Loader: load_suites(suite.py)
Loader-->>CLI: [Suite, ...]
loop each Suite
loop each Case
Runner->>Agent: agent(case.input)
Agent-->>Runner: (output, Trace) | output
Runner->>Runner: graders(trace) → results
alt mode = record
Runner->>Store: save_baseline(trace)
else mode = check
Runner->>Store: save_run_trace(trace)
Runner->>Store: load_baseline(suite, case)
Store-->>Runner: baseline | None
Runner->>Differ: diff_traces(baseline, current, results)
Differ-->>Runner: TraceDelta
end
end
Runner-->>Reporter: RunReport
end
Reporter->>CLI: terminal output, JSON artifact
CLI-->>CLI: exit 1 if any case has_regression else 0
The four objects on disk¶
.agentprdiff/
├── baselines/ ← committed
│ └── <suite_name>/
│ └── <case_name>.json ← the canonical Trace
└── runs/ ← gitignored
└── 20260425T195727Z/ ← timestamped per check
└── <suite_name>/
└── <case_name>.json ← the run's Trace
baselines/is the contract. It's a JSON snapshot of aTrace, pretty-printed for readable git diffs.runs/is the audit trail of recent local executions. It's gitignored by default, never reaches CI, and is safe to delete withrm -rf .agentprdiff/runs/.
A --json-out PATH flag on record and check writes a single report
file at PATH (overwriting on every run). That's the artifact you upload
in CI for downstream automation.
What "regression" means¶
A case is flagged as a regression when any of the following are true:
| Cause | Detected by |
|---|---|
| A grader that previously passed now fails | AssertionChange.is_regression |
| The agent raised in the current run, but not in baseline | TraceDelta.has_regression |
| There is no baseline yet, and any current grader failed | CaseReport.has_regression (first-run-bad is still bad) |
CaseReport.has_regression is the per-case answer.
RunReport.has_regression is any(c.has_regression for c in cases).
The CLI maps this to a process exit code: 1 for any regression, 0
otherwise. agentprdiff review always exits 0 regardless — it's a
local-iteration tool, not a CI gate.
Cost / latency / tokens are diffed too¶
Even if every assertion still passes, the differ records:
cost_delta_usd— change in total cost since baseline.latency_delta_ms— change in total latency since baseline.prompt_tokens_delta/completion_tokens_delta.tool_sequence_changed— boolean; the sequences themselves are kept on the delta so reporters can show the before/after.output_changed+output_diff— adifflib-style unified diff of the final output text.
A delta with no failed graders does not fail CI. It still shows up in
the terminal table so reviewers see drift early. To gate on cost or latency
specifically, attach cost_lt_usd / latency_lt_ms graders to the case.
What agentprdiff is not¶
- It does not parse GitHub PR diffs.
- It does not analyse source code (no AST work, no static analysis).
- It does not install or run an agent framework — your agent is plain code.
- It does not record screen video, browser sessions, or RAG retrievals
unless you record them yourself onto the
Trace. - It does not auto-merge baseline updates. That's the whole point — the PR author re-records intentionally and reviewers see the diff.
Quick reading order¶
- Quickstart — the happy path.
- Usage / Basic — wiring patterns.
- Scenarios — runnable end-to-end examples.
- Architecture — internals when you need to extend it.