Web Dashboard

The waza dashboard provides an interactive web interface for exploring evaluation results, comparing models, and tracking metrics.

Starting the Dashboard

waza serve

Opens http://localhost:3000 automatically.

Views

Dashboard Overview

Summary of all evaluation runs:

Recent Runs — List of recent evaluations with pass rate and model
Pass Rate Trend — Historical pass rate over time
Model Comparison — Side-by-side performance metrics
Top Failing Tasks — Tasks with lowest pass rate

Run Details

Detailed results for a single evaluation run:

Task List — All tasks with pass/fail status
Validator Results — Per-validator scores and messages
Execution Stats — Duration, token usage, tool calls
Trajectory Viewer — Two-mode trace inspection (see below)
Export — Download run as JSON

Trajectory Viewer

The trajectory viewer provides two ways to inspect agent execution:

Timeline view — An Aspire-inspired waterfall visualization. Each tool call renders as a horizontal bar spanning its event range, color-coded by status (green = pass, red = fail, yellow = pending). A summary header shows total spans and per-tool call counts (e.g., “bash × 4, edit × 2”). Click any span to open a detail sidebar with arguments, result, duration, and event range.

Events view — A linear transcript showing every event in order (turns, tool calls, errors, partial results). Each entry is expandable for full detail.

Toggle between views with the Timeline / Events buttons at the top of the panel.

Waterfall Timeline

The waterfall timeline arranges tool call spans on a horizontal axis proportional to event count. Features:

Status indicators — Green ✓, red ✗, yellow ⏳
Call indexing — Repeated tools show call number badges (bash #1, bash #2)
Span correlation — Start and complete events matched by toolCallId
Interleaved support — Handles concurrent or nested tool calls
Detail sidebar — Click a span to see arguments, result, and raw event data; press Escape to close

Compare View

Side-by-side comparison of two or more runs:

Model Metrics — Pass rate, average duration, tool call efficiency
Task Comparison — Which model passed each task
Validator Performance — Per-validator scores across models
Statistical Analysis — Confidence intervals, effect sizes

Trends

Historical metrics over time:

Pass Rate Trend — Model performance trending
Duration Trend — Execution speed over time
Task Coverage — Which tasks are consistently tested
Model Adoption — Usage patterns across models

Features

Live Updates

Dashboard auto-refreshes when new results are saved:

# Terminal 1: Serve dashboard
waza serve

# Terminal 2: Run evaluations
waza run eval.yaml -o results.json
# Dashboard refreshes automatically

Filtering

Filter results by:

Status — Passed / Failed
Tags — Task tags (e.g., “basic”, “edge-case”)
Date Range — Last 7 days, month, all-time
Model — Specific model only

Search

Full-text search across:

Task names and descriptions
Validator messages
Error messages
Transcripts

Export

Export data in multiple formats:

JSON — Complete result structure
CSV — Task results for spreadsheet analysis
PDF — Formatted report for sharing

Configuration

Dashboard Port

Change the default port (3000):

waza serve --port 8080

JSON-RPC Server

Run as a JSON-RPC TCP server instead of HTTP:

waza serve --tcp :9000

Connect from other applications using JSON-RPC 2.0 protocol.

Stdin/Stdout

Use stdio for piping:

waza serve --stdio

Result Format

Dashboard loads JSON results with this structure:

{
  "name": "code-explainer-eval",
  "model": "claude-sonnet-4.6",
  "timestamp": "2025-02-20T10:30:00Z",
  "pass_rate": 0.8,
  "duration_ms": 30000,
  "tasks": [
    {
      "id": "basic-001",
      "name": "Basic Usage",
      "passed": true,
      "duration_ms": 5000,
      "graders": [
        {
          "name": "checks_logic",
          "passed": true,
          "score": 1.0,
          "message": "All patterns matched"
        }
      ]
    }
  ]
}

Workflow

Local iteration with dashboard:

# Terminal 1: Start dashboard
cd my-eval-suite
waza serve

# Terminal 2: Run evaluations
waza run code-explainer -o results.json

# Terminal 3 (optional): Monitor results
# Dashboard auto-refreshes, or manually refresh in browser

Comparison workflow:

# Run with multiple models
waza run eval.yaml --model gpt-4o -o gpt4.json
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json

# View in dashboard
# Select both results for side-by-side comparison

Integration with CI/CD

Dashboard works with GitHub Actions:

Evaluation runs in CI generate results.json
Results uploaded as workflow artifact
Download artifact and open in dashboard:

# Download from GitHub
gh run download <run-id> -n results

# View in dashboard
waza serve

Troubleshooting

”Connection refused”

Dashboard not running. Start with waza serve.

Port already in use

Use a different port:

waza serve --port 8080

Results not loading

Ensure JSON is valid:

jq . results.json

Next Steps

CLI Reference — All commands
YAML Schema — eval.yaml format
GitHub Repository — Source and examples