Web Dashboard
The waza dashboard provides an interactive web interface for exploring evaluation results, comparing models, and tracking metrics.
Starting the Dashboard
Section titled “Starting the Dashboard”waza serveOpens http://localhost:3000 automatically.
Dashboard Overview
Section titled “Dashboard Overview”Summary of all evaluation runs:
- Recent Runs — List of recent evaluations with pass rate and model
- Pass Rate Trend — Historical pass rate over time
- Model Comparison — Side-by-side performance metrics
- Top Failing Tasks — Tasks with lowest pass rate
Run Details
Section titled “Run Details”Detailed results for a single evaluation run:
- Task List — All tasks with pass/fail status
- Validator Results — Per-validator scores and messages
- Execution Stats — Duration, token usage, tool calls
- Trajectory Viewer — Two-mode trace inspection (see below)
- Export — Download run as JSON
Trajectory Viewer
Section titled “Trajectory Viewer”The trajectory viewer provides two ways to inspect agent execution:
Timeline view — An Aspire-inspired waterfall visualization. Each tool call renders as a horizontal bar spanning its event range, color-coded by status (green = pass, red = fail, yellow = pending). A summary header shows total spans and per-tool call counts (e.g., “bash × 4, edit × 2”). Click any span to open a detail sidebar with arguments, result, duration, and event range.
Events view — A linear transcript showing every event in order (turns, tool calls, errors, partial results). Each entry is expandable for full detail.
Toggle between views with the Timeline / Events buttons at the top of the panel.
Waterfall Timeline
Section titled “Waterfall Timeline”The waterfall timeline arranges tool call spans on a horizontal axis proportional to event count. Features:
- Status indicators — Green ✓, red ✗, yellow ⏳
- Call indexing — Repeated tools show call number badges (
bash #1,bash #2) - Span correlation — Start and complete events matched by
toolCallId - Interleaved support — Handles concurrent or nested tool calls
- Detail sidebar — Click a span to see arguments, result, and raw event data; press Escape to close
Compare View
Section titled “Compare View”Side-by-side comparison of two or more runs:
- Model Metrics — Pass rate, average duration, tool call efficiency
- Task Comparison — Which model passed each task
- Validator Performance — Per-validator scores across models
- Statistical Analysis — Confidence intervals, effect sizes
Trends
Section titled “Trends”Historical metrics over time:
- Pass Rate Trend — Model performance trending
- Duration Trend — Execution speed over time
- Task Coverage — Which tasks are consistently tested
- Model Adoption — Usage patterns across models
Features
Section titled “Features”Live Updates
Section titled “Live Updates”Dashboard auto-refreshes when new results are saved:
# Terminal 1: Serve dashboardwaza serve
# Terminal 2: Run evaluationswaza run eval.yaml -o results.json# Dashboard refreshes automaticallyFiltering
Section titled “Filtering”Filter results by:
- Status — Passed / Failed
- Tags — Task tags (e.g., “basic”, “edge-case”)
- Date Range — Last 7 days, month, all-time
- Model — Specific model only
Search
Section titled “Search”Full-text search across:
- Task names and descriptions
- Validator messages
- Error messages
- Transcripts
Export
Section titled “Export”Export data in multiple formats:
- JSON — Complete result structure
- CSV — Task results for spreadsheet analysis
- PDF — Formatted report for sharing
Configuration
Section titled “Configuration”Dashboard Port
Section titled “Dashboard Port”Change the default port (3000):
waza serve --port 8080JSON-RPC Server
Section titled “JSON-RPC Server”Run as a JSON-RPC TCP server instead of HTTP:
waza serve --tcp :9000Connect from other applications using JSON-RPC 2.0 protocol.
Stdin/Stdout
Section titled “Stdin/Stdout”Use stdio for piping:
waza serve --stdioResult Format
Section titled “Result Format”Dashboard loads JSON results with this structure:
{ "name": "code-explainer-eval", "model": "claude-sonnet-4.6", "timestamp": "2025-02-20T10:30:00Z", "pass_rate": 0.8, "duration_ms": 30000, "tasks": [ { "id": "basic-001", "name": "Basic Usage", "passed": true, "duration_ms": 5000, "graders": [ { "name": "checks_logic", "passed": true, "score": 1.0, "message": "All patterns matched" } ] } ]}Workflow
Section titled “Workflow”Local iteration with dashboard:
# Terminal 1: Start dashboardcd my-eval-suitewaza serve
# Terminal 2: Run evaluationswaza run code-explainer -o results.json
# Terminal 3 (optional): Monitor results# Dashboard auto-refreshes, or manually refresh in browserComparison workflow:
# Run with multiple modelswaza run eval.yaml --model gpt-4o -o gpt4.jsonwaza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
# View in dashboard# Select both results for side-by-side comparisonIntegration with CI/CD
Section titled “Integration with CI/CD”Dashboard works with GitHub Actions:
- Evaluation runs in CI generate
results.json - Results uploaded as workflow artifact
- Download artifact and open in dashboard:
# Download from GitHubgh run download <run-id> -n results
# View in dashboardwaza serveTroubleshooting
Section titled “Troubleshooting””Connection refused”
Section titled “”Connection refused””Dashboard not running. Start with waza serve.
Port already in use
Section titled “Port already in use”Use a different port:
waza serve --port 8080Results not loading
Section titled “Results not loading”Ensure JSON is valid:
jq . results.jsonNext Steps
Section titled “Next Steps”- CLI Reference — All commands
- YAML Schema — eval.yaml format
- GitHub Repository — Source and examples