Explore the Dashboard
The waza dashboard gives you a complete picture of your AI agent skill evaluations. Launch it with a single command and explore every view described below.
waza serve --results-dir ./resultsEval Runs Overview
Section titled “Eval Runs Overview”The landing page shows all evaluation runs at a glance. Six KPI cards summarize your total runs, tasks, pass rate, token usage, cost, and duration. Below the cards, a sortable table lists every run with its spec, model, pass rate, weighted score, task count, tokens, cost, duration, and relative timestamp.
🔍 Click to enlarge
- KPI cards — Total Runs, Total Tasks, Pass Rate, Avg Tokens, Avg Cost, and Avg Duration update in real time as results load
- Sortable columns — click any column header (Spec, Model, Tasks, Tokens, Cost, Duration, When) to sort ascending or descending
- Judge model indicator — a ⚖ icon next to the model name indicates a separate judge model was used for grading
- Export CSV — download the full runs table as a CSV file for offline analysis
- Click any row to drill into that run’s detail view
Run Detail — Tasks
Section titled “Run Detail — Tasks”Clicking a run opens its detail page. The Tasks tab shows every task in the evaluation with its outcome, raw score, weighted score, and duration. Expand any row to see individual grader results.
🔍 Click to enlarge
- Outcome badges — green
passand redfailbadges make it easy to spot problems - Score vs. W. Score — the raw grader score and the weighted aggregate score are shown side by side
- Statistical confidence — tasks with weighted scores display significance badges (✓ significant / ⚠ not significant) and confidence interval ranges
- Export CSV — export task-level results for the current run
Run Detail — Trajectory
Section titled “Run Detail — Trajectory”Switch to the Trajectory tab to see the agent’s execution path. The task selector lists every task with its pass/fail badge.
🔍 Click to enlarge
- Task selector — every task listed as a clickable button with its outcome badge
- Color-coded badges — green for pass, red for fail
- Quick navigation — click any task to load its full trajectory
Waterfall Timeline
Section titled “Waterfall Timeline”After selecting a task, the waterfall timeline shows a session digest and execution spans. Each span represents a tool call the agent made.
🔍 Click to enlarge
- Session digest — Turns, Tool Calls, Tokens In/Out/Total, and Tools Used at a glance
- Trace header — total span count and tool call breakdown (e.g.,
read_file × 2 create_file × 4 bash × 1) - Span bars — teal/emerald bars sized proportionally to duration, labeled with tool names
- Timeline axis — seconds scale showing relative timing of each call
Span Detail Panel
Section titled “Span Detail Panel”Click any span bar to open the detail panel with full tool call information.
🔍 Click to enlarge
- Tool name — which tool was called (bash, create_file, read_file, etc.)
- Status badge — Passed/Failed indicator
- Attributes — Duration, event count, event range, call ID
- Arguments — the exact arguments passed to the tool (JSON formatted)
- Result — expandable section with the tool’s return value
Events View
Section titled “Events View”Toggle from Timeline to Events to see a chronological list of every event in the agent session.
🔍 Click to enlarge
- Assistant turns — the agent’s reasoning text before and after tool calls
- Tool start/complete pairs — each tool call shown as start → complete with tool name and call ID
- Show details — expand any event to see its full content
🔍 Click to enlarge
- Expanded detail — full text of assistant reasoning or tool arguments/results
- Collapsible — click “Hide details” to collapse back
Task Grader Expansion
Section titled “Task Grader Expansion”Back on the Tasks tab, click any task row to expand and see individual grader results.
🔍 Click to enlarge
- Per-grader rows — each grader shows name, type, pass/fail, score, and weight
- Weight display — grader weights shown as
×Nmultiplier badges - Weighted score — composite W. Score computed from individual grader weights
Compare Runs
Section titled “Compare Runs”The Compare view lets you select any two runs and see their differences side by side. Select runs from the dropdowns and the comparison appears instantly.
🔍 Click to enlarge
- Run cards — each selected run shows its spec, model, and timestamp
- Metrics comparison — Pass Rate, Tokens, Cost, and Duration are compared with delta indicators (↑ increase in red, ↓ decrease in green)
- Pass rate bars — horizontal bar chart visually compares pass rates
- Per-task comparison — a table shows each task’s outcome, score, and duration for both runs, with Δ Score and Δ Duration columns highlighting differences
- Click any row to view a trajectory diff between the two runs
Trends
Section titled “Trends”The Trends page charts your evaluation metrics over time. Four line charts show Pass Rate, Tokens per Run, Cost per Run, and Duration per Run across all runs, ordered chronologically.
🔍 Click to enlarge
- Pass Rate — track whether your agent skills are improving or regressing
- Tokens per Run — monitor token consumption trends to catch runaway prompts
- Cost per Run — visualize spending patterns across evaluation runs
- Duration per Run — spot performance regressions in execution time
- Model filter — use the “Model” dropdown (top-right) to filter charts to a specific model (e.g., only gpt-4o runs)
Live Monitoring
Section titled “Live Monitoring”The Live view connects to the waza server via WebSocket to show real-time evaluation progress. When no evaluation is running, it shows a disconnected state with instructions to start one.
🔍 Click to enlarge
- WebSocket status — a red “Disconnected” badge (top-right) shows the current connection state; it turns green when an evaluation is running
- Start prompt — when idle, the view shows a
waza runcommand hint to start a new evaluation - Real-time updates — during an active run, tasks and grader results stream in as they complete
# Start a live evaluation and watch it in the dashboardwaza serve --results-dir ./results &waza run eval.yaml --context-dir ./fixtures --liveWeighted Scores
Section titled “Weighted Scores”The W. Score column in the Tasks table shows the weighted aggregate score for each task. When graders have different weights configured in the eval YAML, the weighted score reflects their relative importance.
🔍 Click to enlarge
- W. Score column — appears in both the Runs overview and Run Detail views
- Per-grader weights — configure weights in your eval YAML under each grader’s
weightfield - Dash (—) for runs without weighted scoring configured — the column shows a dash when weights are not defined
Statistical Confidence
Section titled “Statistical Confidence”Tasks with weighted scores display statistical significance indicators. These help you determine whether score differences between models or runs are meaningful or just noise.
🔍 Click to enlarge
- ✓ significant (green) — the score is statistically significant with tight confidence intervals (e.g., [82.0%, 85.0%])
- ⚠ not significant (yellow) — the score has wide confidence intervals (e.g., [45.0%, 90.0%]), meaning more data is needed
- Confidence intervals — shown as [lower%, upper%] ranges below the weighted score
- Actionable insight — significant scores validate your skill’s behavior; non-significant scores suggest you need more test cases or the grader criteria need refinement