Troubleshooting¶
The fastest way to fix it.
"no module named agentprdiff"¶
You're running from a shell that doesn't see the install. Common causes:
- Wrong virtualenv. Activate the one that has
pip install agentprdiff. - Installed with
pipxbut thepipxshim isn't onPATH. - Editable install in a different repo.
pip install -e .again here.
"no such file: suite.py"¶
- Path is relative to the current working directory, not to
.agentprdiff/.cdinto the suite's parent or pass the absolute path. - Glob expansion didn't run — use quotes only when you mean a glob:
agentprdiff check 'suites/*.py'← wrong.agentprdiff check suites/*.py← right (shell expands).
" defines no module-level Suite objects"¶
Only module-level Suite instances are picked up. Bind the result of
suite(...) to a variable at the top level:
"no cases matched --case/--skip filters."¶
Exit code 2. The filter matched zero cases. Common causes:
- Typo in the case name (filtering is case-insensitive but exact about spelling).
- You qualified by suite (
billing:refund*) but the suite name is different in your file. - You added a
~negatepattern that excluded everything.
"Trace pydantic validation error" when loading a baseline¶
Someone hand-edited a baseline JSON, or a major schema change happened.
git checkout HEAD -- .agentprdiff/baselines/<suite>/<case>.json
# or
rm .agentprdiff/baselines/<suite>/<case>.json
agentprdiff record suite.py --case <case>
Never edit baselines by hand — they exist to be regenerated.
"REGRESSION" but the diff looks identical¶
A grader's passed flipped even though the trace shape looks the same.
Usually one of:
- A grader was added to the case since the baseline. Its
baseline_passedisNone, treated as previously-passing-or-absent; if it now fails, that's a regression. Re-record the baseline if the current behavior is correct. - The grader's
namechanged (e.g. you switched fromcontains("refund")tocontains("Refund", case_sensitive=True)). The differ keys per-assertion regression detection bygrader_name, so a renamed grader looks new. Same fix: re-record.
Adapter says "instrument_client expected an OpenAI-style client"¶
TypeError: instrument_client expected an OpenAI-style client with
client.chat.completions.create; got LegacyOpenAI.
You're on the legacy openai SDK (≤ 0.x) or a wrapper that hides the
chat.completions.create attribute. Either upgrade the SDK
(pip install -U openai) or switch to manual instrumentation:
"no pricing entry for model 'foo'; cost_usd will be recorded as 0.0"¶
The adapter doesn't know the price of the model you used. Cost-budget graders will trivially pass.
from agentprdiff.adapters import register_prices
register_prices({"foo": (0.0009, 0.0018)}) # ($/1k input, $/1k output)
Or pass prices= to the specific instrument_client call.
semantic() quietly returns PASS even though the model is wrong¶
You probably hit the silent fake_judge fallback. Check the banner at
the top of the run output:
semantic judge: fake_judge (no AGENTGUARD_JUDGE, no OPENAI_API_KEY/ANTHROPIC_API_KEY — silent fallback)
fake_judge matches keywords ≥ 4 chars. To get a real judge:
export OPENAI_API_KEY=sk-...
# or
export ANTHROPIC_API_KEY=sk-ant-...
# or pin explicitly
export AGENTGUARD_JUDGE=anthropic
CI passes locally but fails in CI¶
Almost always one of:
- Different
AGENTGUARD_JUDGE. Local hasOPENAI_API_KEYset, CI usesfake_judge. Set both explicitly:AGENTGUARD_JUDGE=fake(or=anthropic/=openaiwith the matching secret) on both sides so the judge is identical. - Different
pricingtable. Local hasregister_prices(...)in a helper that CI doesn't import. Move the call into the suite file. - Different agent SDK version. Pin in
requirements.txtorconstraints.txt. - Different stub data. Stubs that read from a JSON file outside the repo don't ship to CI. Inline them.
"the agent does the right thing but a tool_called grader fails"¶
Your agent called the tool, but the trace doesn't have the call recorded. Two usual causes:
- The agent does the call internally but doesn't return a
Trace. The runner can only see what's in the returned trace. Wire uptrace.record_tool_call(ToolCall(...))(or use an SDK adapter). - You're using
instrument_toolsbut dispatching the originalTOOL_MAPinstead of the wrapped dict. Make sure your tool-loop usestools[name](**args), notTOOL_MAP[name](**args).
Asyncio: "RuntimeError: This event loop is already running"¶
You called asyncio.run from inside an already-running event loop
(common in Jupyter or some test runners). For agentprdiff:
def my_agent(query):
try:
loop = asyncio.get_running_loop()
except RuntimeError:
return asyncio.run(my_agent_async(query))
# We're inside a loop — use loop.run_until_complete or restructure.
return loop.run_until_complete(my_agent_async(query))
Long-term, push the asyncio bridge into your eval wrapper, not into production code.
"test_runner.py::test_record_overwrites_baselines fails"¶
You're hacking on agentprdiff itself and a baseline-test broke. The most
common cause: a stray .agentprdiff/ directory left over from a previous
run. The tests use tmp_path fixtures, but if your cwd has a top-level
.agentprdiff/ it can leak in. rm -rf .agentprdiff/ and try again.
Where to file a bug¶
github.com/vnageshwaran-de/agentprdiff/issues
Include:
- Output of
agentprdiff --version. - Minimal reproducer (suite + agent stub).
- Full terminal output and the relevant baseline JSON if applicable.