Baseline. Compare. Ship.

Know when your agents break before your users do.

Capture golden traces. Detect regressions automatically. Compare agent runs across models, prompts, and configs. CI/CD native.

$ agentprobe run --baseline v2.4 --candidate v2.5

▸ Loading golden traces (47 scenarios)...
▸ Running candidate against baselines...

✓ task_success: 0.96 (baseline: 0.94) +2.1%
✓ tool_accuracy: 0.98 (baseline: 0.97) +1.0%
✗ faithfulness: 0.71 (baseline: 0.85) -16.5%
⚠ cost_per_task: $0.12 (baseline: $0.08) +50%

REGRESSION DETECTED — faithfulness dropped below threshold
→ Divergence at step 23: agent hallucinated tool response
→ Trace diff: agentprobe diff abc123 def456

60%

fewer production incidents with structured eval

32%

of teams cite quality as top agent deployment blocker

57%

of organizations now run agents in production

How It Works

⬡

Golden Traces

Record successful agent runs as golden baselines. Every prompt, tool call, and result captured. Replay anytime to verify new versions match expected behavior.

△

Regression Detection

Automatically compare candidate runs against baselines. Surface divergences at the exact step they occur. Block deploys that break agent quality.

◇

Cohort Comparison

Run the same task across model versions, prompt variations, or harness configs. Side-by-side trajectory diffs show you exactly what changed and why.

▣

Multi-Dimensional Scoring

Task success, tool accuracy, faithfulness, cost per run. Set thresholds per metric. One regression in any dimension blocks the pipeline.

Agents are shipping faster than teams can evaluate them.

AgentProbe makes quality the gate, not the afterthought. Baseline your agents today, catch regressions tomorrow, ship with confidence every time.