Know when your agents break before your users do.
Capture golden traces. Detect regressions automatically. Compare agent runs across models, prompts, and configs. CI/CD native.
▸ Loading golden traces (47 scenarios)...
▸ Running candidate against baselines...
✓ task_success: 0.96 (baseline: 0.94) +2.1%
✓ tool_accuracy: 0.98 (baseline: 0.97) +1.0%
✗ faithfulness: 0.71 (baseline: 0.85) -16.5%
⚠ cost_per_task: $0.12 (baseline: $0.08) +50%
REGRESSION DETECTED — faithfulness dropped below threshold
→ Divergence at step 23: agent hallucinated tool response
→ Trace diff: agentprobe diff abc123 def456
Golden Traces
Record successful agent runs as golden baselines. Every prompt, tool call, and result captured. Replay anytime to verify new versions match expected behavior.
Regression Detection
Automatically compare candidate runs against baselines. Surface divergences at the exact step they occur. Block deploys that break agent quality.
Cohort Comparison
Run the same task across model versions, prompt variations, or harness configs. Side-by-side trajectory diffs show you exactly what changed and why.
Multi-Dimensional Scoring
Task success, tool accuracy, faithfulness, cost per run. Set thresholds per metric. One regression in any dimension blocks the pipeline.
Agents are shipping faster than teams can evaluate them.
AgentProbe makes quality the gate, not the afterthought. Baseline your agents today, catch regressions tomorrow, ship with confidence every time.