Skip to content

Web Dashboard

The waza dashboard provides an interactive web interface for exploring evaluation results, comparing models, and tracking metrics.

Terminal window
waza serve

Opens http://localhost:3000 automatically.

Summary of all evaluation runs:

  • Recent Runs — List of recent evaluations with pass rate and model
  • Pass Rate Trend — Historical pass rate over time
  • Model Comparison — Side-by-side performance metrics
  • Top Failing Tasks — Tasks with lowest pass rate

Detailed results for a single evaluation run:

  • Task List — All tasks with pass/fail status
  • Validator Results — Per-validator scores and messages
  • Execution Stats — Duration, token usage, tool calls
  • Trajectory Viewer — Two-mode trace inspection (see below)
  • Export — Download run as JSON

The trajectory viewer provides two ways to inspect agent execution:

Timeline view — An Aspire-inspired waterfall visualization. Each tool call renders as a horizontal bar spanning its event range, color-coded by status (green = pass, red = fail, yellow = pending). A summary header shows total spans and per-tool call counts (e.g., “bash × 4, edit × 2”). Click any span to open a detail sidebar with arguments, result, duration, and event range.

Events view — A linear transcript showing every event in order (turns, tool calls, errors, partial results). Each entry is expandable for full detail.

Toggle between views with the Timeline / Events buttons at the top of the panel.

The waterfall timeline arranges tool call spans on a horizontal axis proportional to event count. Features:

  • Status indicators — Green ✓, red ✗, yellow ⏳
  • Call indexing — Repeated tools show call number badges (bash #1, bash #2)
  • Span correlation — Start and complete events matched by toolCallId
  • Interleaved support — Handles concurrent or nested tool calls
  • Detail sidebar — Click a span to see arguments, result, and raw event data; press Escape to close

Side-by-side comparison of two or more runs:

  • Model Metrics — Pass rate, average duration, tool call efficiency
  • Task Comparison — Which model passed each task
  • Validator Performance — Per-validator scores across models
  • Statistical Analysis — Confidence intervals, effect sizes

Historical metrics over time:

  • Pass Rate Trend — Model performance trending
  • Duration Trend — Execution speed over time
  • Task Coverage — Which tasks are consistently tested
  • Model Adoption — Usage patterns across models

Dashboard auto-refreshes when new results are saved:

Terminal window
# Terminal 1: Serve dashboard
waza serve
# Terminal 2: Run evaluations
waza run eval.yaml -o results.json
# Dashboard refreshes automatically

Filter results by:

  • Status — Passed / Failed
  • Tags — Task tags (e.g., “basic”, “edge-case”)
  • Date Range — Last 7 days, month, all-time
  • Model — Specific model only

Full-text search across:

  • Task names and descriptions
  • Validator messages
  • Error messages
  • Transcripts

Export data in multiple formats:

  • JSON — Complete result structure
  • CSV — Task results for spreadsheet analysis
  • PDF — Formatted report for sharing

Change the default port (3000):

Terminal window
waza serve --port 8080

Run as a JSON-RPC TCP server instead of HTTP:

Terminal window
waza serve --tcp :9000

Connect from other applications using JSON-RPC 2.0 protocol.

Use stdio for piping:

Terminal window
waza serve --stdio

Dashboard loads JSON results with this structure:

{
"name": "code-explainer-eval",
"model": "claude-sonnet-4.6",
"timestamp": "2025-02-20T10:30:00Z",
"pass_rate": 0.8,
"duration_ms": 30000,
"tasks": [
{
"id": "basic-001",
"name": "Basic Usage",
"passed": true,
"duration_ms": 5000,
"graders": [
{
"name": "checks_logic",
"passed": true,
"score": 1.0,
"message": "All patterns matched"
}
]
}
]
}

Local iteration with dashboard:

Terminal window
# Terminal 1: Start dashboard
cd my-eval-suite
waza serve
# Terminal 2: Run evaluations
waza run code-explainer -o results.json
# Terminal 3 (optional): Monitor results
# Dashboard auto-refreshes, or manually refresh in browser

Comparison workflow:

Terminal window
# Run with multiple models
waza run eval.yaml --model gpt-4o -o gpt4.json
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
# View in dashboard
# Select both results for side-by-side comparison

Dashboard works with GitHub Actions:

  1. Evaluation runs in CI generate results.json
  2. Results uploaded as workflow artifact
  3. Download artifact and open in dashboard:
Terminal window
# Download from GitHub
gh run download <run-id> -n results
# View in dashboard
waza serve

Dashboard not running. Start with waza serve.

Use a different port:

Terminal window
waza serve --port 8080

Ensure JSON is valid:

Terminal window
jq . results.json