Skip to content

Roadmap

agentprdiff is alpha (0.2.x). The core model, CLI, and OpenAI / Anthropic SDK adapters are stable. The OpenAI adapter covers both sync OpenAI and async AsyncOpenAI clients via the same instrument_client context manager.

On the 0.3 roadmap

  • Async Anthropic adapter. Today the Anthropic adapter is sync only. Mirror the OpenAI sync/async detection so AsyncAnthropic clients work with the same instrument_client API.
  • LangChain / LangGraph adapter. A with instrument_runnable(chain) context manager that records every chain / tool / LLM call as LLMCall / ToolCall entries.
  • Vercel AI SDK companion. A small JS package that produces the same baseline JSON format from ai/sdk agents, so JS- and Python-shop users share a CI gate.
  • Tag-based filtering. Today --case / --skip match case names; add --tag smoke / --no-tag slow for cases marked with tags=[...].

Under consideration

  • Parallel case execution. Run cases within a suite concurrently with a thread or process pool. Useful for suites where each case is bottlenecked on network latency.
  • Streaming reporter. Print case results as they finish instead of buffering until the end. Important for long suites where the user wants early signal.
  • GitHub annotations reporter. Surface regressions as ::error file=...,line=... annotations on the PR diff so the failing case shows up next to the line that changed.
  • JUnit XML reporter. First-class CI integration with test-result aggregators that already understand JUnit.
  • Bedrock / Vertex AI adapters. Both have their own response shapes; workable today via manual instrumentation, but a first-class adapter would be welcome.
  • Cost-budget delta graders. Today cost_lt_usd(0.02) is an absolute ceiling. Adding cost_increase_lt_pct(20) would flag a 20% jump even if the absolute number is still under the ceiling.
  • Pluggable input fixtures. A case(input=fixture("orders.csv:row=4")) shape so cases can refer to large structured inputs without inlining them.
  • Replay mode. Re-run a recorded baseline through the differ without invoking the agent — useful for regression tests of agentprdiff itself and for benchmark drift detection.

Out of scope

  • A hosted SaaS. The point of committed baselines is that the diff lives next to the code; a hosted store undoes that.
  • A new agent framework. agentprdiff deliberately does not care how your agent is built.
  • Pairwise / ELO evaluation. Different problem.
  • Auto-merging baseline updates. The PR diff in .agentprdiff/baselines/ is the review surface — automating it away defeats the whole point.

How to weigh in

  • Open an issue with the proposal label.
  • Reference real adoption pain — the more concrete the better.
  • A working PR is the strongest argument.

The maintainer is one person; bandwidth is finite. Small, focused PRs that fit the Contributing scope merge fastest.