CLI Commands
Complete reference for all waza CLI commands and their options.
Installation
Section titled “Installation”curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashwaza --versionwaza run
Section titled “waza run”Run an evaluation benchmark.
waza run [eval.yaml | skill-name]Arguments
Section titled “Arguments”| Argument | Description |
|---|---|
[eval.yaml] | Path to evaluation spec file |
[skill-name] | Skill name (auto-detects eval.yaml) |
| (none) | Auto-detect using workspace detection |
| Flag | Short | Type | Default | Description |
|---|---|---|---|---|
--context-dir | -c | string | ./fixtures | Fixtures directory path |
--output | -o | string | Save results JSON to file | |
--output-dir | -d | string | Save output artifacts to directory | |
--verbose | -v | bool | false | Detailed progress output |
--parallel | bool | false | Run tasks concurrently | |
--workers | -w | int | 4 | Number of concurrent workers |
--task | -t | string | Filter tasks by name (repeatable) | |
--tags | string | Filter tasks by tags (repeatable) | ||
--model | -m | string | Override model (repeatable) | |
--judge-model | string | Model for LLM-as-judge graders (overrides execution model) | ||
--cache | bool | false | Enable result caching | |
--cache-dir | string | .waza-cache | Cache directory path | |
--format | -f | string | default | Output format: default, github-comment |
--reporter | string[] | Output reporters: json, junit:<path> (repeatable) | ||
--timeout | int | 300 | Task timeout in seconds | |
--baseline | bool | false | A/B testing mode — runs each task twice (without skill = baseline, with skill = normal) and computes improvement scores | |
--discover | string | Auto skill discovery — walks directory tree for SKILL.md + eval.yaml (root/tests/evals) | ||
--strict | bool | false | Fail if any SKILL.md lacks eval coverage (use with --discover) |
Examples
Section titled “Examples”# Run all taskswaza run eval.yaml -v
# Run specific skillwaza run code-explainer
# Specify fixtures directorywaza run eval.yaml -c ./fixtures -v
# Save resultswaza run eval.yaml -o results.json
# Filter to specific taskswaza run eval.yaml --task "basic*" --task "edge*"
# Multiple models (parallel)waza run eval.yaml --model gpt-4o --model claude-sonnet-4.6
# Use a different judge model for LLM-as-judge graderswaza run eval.yaml --model gpt-4o --judge-model claude-opus-4.6
# Parallel execution with 8 workerswaza run eval.yaml --parallel --workers 8
# With cachingwaza run eval.yaml --cache --cache-dir .waza-cache
# Generate JUnit XML for CI test reportingwaza run eval.yaml --reporter junit:results.xml
# A/B testing: baseline vs skill performancewaza run eval.yaml --baseline -o results.json# Output includes improvement breakdown (quality, tokens, turns, time, completion)
# Auto skill discoverywaza run --discover ./skills/
# Auto discovery with strict mode (fail if any SKILL.md lacks eval coverage)waza run --discover --strict ./skills/waza init
Section titled “waza init”Initialize a new waza project.
waza init [directory]Arguments
Section titled “Arguments”| Argument | Description |
|---|---|
[directory] | Project directory (default: current dir) |
| Flag | Description |
|---|---|
--no-skill | Skip first skill creation prompt |
Creates
Section titled “Creates”project-root/├── skills/├── evals/├── .github/workflows/eval.yml├── .gitignore└── README.mdExamples
Section titled “Examples”waza init my-projectwaza init my-project --no-skillwaza new
Section titled “waza new”Create a new skill.
waza new [skill-name]Project mode (inside a skills/ directory):
cd my-skills-repowaza new code-explainer# Creates skills/code-explainer/SKILL.md + evals/code-explainer/Standalone mode (no skills/ directory):
waza new my-skill# Creates my-skill/ with all files| Flag | Description |
|---|---|
--template | Template pack (coming soon) |
--interactive | Interactive metadata collection |
Examples
Section titled “Examples”# Interactive wizardwaza new code-analyzer --interactive
# Non-interactive (CI/CD)waza new code-analyzer << EOFCode AnalyzerAnalyzes code for patterns and issuescode, analysisEOFwaza check
Section titled “waza check”Validate skill compliance and readiness.
waza check [skill-name | skill-path]Arguments
Section titled “Arguments”| Argument | Description |
|---|---|
[skill-name] | Skill name (e.g., code-explainer) |
[skill-path] | Path to skill directory |
| (none) | Check all skills in workspace |
| Flag | Description |
|---|---|
--verbose | Detailed compliance report |
--format | Output format: text (default), json |
Output
Section titled “Output”🔍 Skill Readiness Check━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: code-explainer
📋 Compliance Score: High ✅ Excellent! Your skill meets all requirements.
📊 Token Budget: 420 / 500 tokens ✅ Within budget.
🧪 Evaluation Suite: Found ✅ eval.yaml detected.
✅ Your skill is ready for submission!Examples
Section titled “Examples”waza check code-explainerwaza check ./skills/code-explainerwaza check --verbosewaza compare
Section titled “waza compare”Compare evaluation results across models.
waza compare [results-1.json] [results-2.json] ...Arguments
Section titled “Arguments”| Argument | Description |
|---|---|
[results-N.json] | Result files to compare (2+ required) |
| Flag | Description |
|---|---|
--format | Output format: table (default), json |
Examples
Section titled “Examples”waza compare gpt4.json sonnet.jsonwaza compare gpt4.json sonnet.json opus.jsonwaza compare results-*.json --format jsonwaza suggest
Section titled “waza suggest”Generate suggested eval artifacts from a skill’s SKILL.md using an LLM.
waza suggest <skill-path>| Flag | Description |
|---|---|
--model | Model to use for suggestions (default: project default model) |
--dry-run | Print suggestions to stdout (default) |
--apply | Write suggested files to disk |
--output-dir | Output directory (default: <skill-path>/evals) |
--format | Output format: yaml (default), json |
Examples
Section titled “Examples”# Preview suggestionswaza suggest skills/code-explainer --dry-run
# Write eval/task/fixture fileswaza suggest skills/code-explainer --apply
# JSON outputwaza suggest skills/code-explainer --format jsonwaza tokens
Section titled “waza tokens”Token budget management.
waza tokens count
Section titled “waza tokens count”Count tokens in skill files.
waza tokens count [path]waza tokens count skills/code-explainer/SKILL.mdwaza tokens count skills/waza tokens check
Section titled “waza tokens check”Check token usage against budget.
waza tokens check [skill-name]waza tokens check code-explainer# Output:# code-explainer: 420 / 500 tokens (84%)# ✅ Within budgetwaza tokens profile
Section titled “waza tokens profile”Structural analysis of SKILL.md files — reports token count, section count, code block count, and workflow step detection.
waza tokens profile [skill-name | path]Flags: --format text|json, --tokenizer bpe|estimate
Example:
📊 my-skill: 1,722 tokens (detailed ✓), 8 sections, 4 code blocks ⚠️ no workflow steps detectedWarnings: no workflow steps, >2,500 tokens, fewer than 3 sections.
waza tokens suggest
Section titled “waza tokens suggest”Get optimization suggestions.
waza tokens suggest [skill-name]Analyzes SKILL.md and suggests:
- Sections to shorten
- Removable content
- Restructuring opportunities
waza results
Section titled “waza results”Manage evaluation results from cloud storage or local storage.
waza results list
Section titled “waza results list”List evaluation runs from configured cloud storage.
waza results listwaza results list --limit 20waza results list --format json| Flag | Description |
|---|---|
--limit <n> | Maximum results to display (default: 10) |
--format | Output format: table or json (default: table) |
waza results compare
Section titled “waza results compare”Compare two evaluation runs side by side.
waza results compare run-id-1 run-id-2waza results compare run-id-1 run-id-2 --format json| Flag | Description |
|---|---|
--format | Output format: table or json (default: table) |
waza dev
Section titled “waza dev”Improve skill compliance iteratively.
waza dev [skill-name | skill-path]| Flag | Description |
|---|---|
--target | Target level: low, medium, high |
--max-iterations | Max improvement loops (default: 5) |
--auto | Auto-apply without prompting |
--fast | Skip integration tests |
Workflow
Section titled “Workflow”waza dev code-explainer --target high --autoIteratively:
- Scores current compliance
- Identifies issues
- Suggests improvements
- Applies changes
- Re-scores
- Repeats until target reached
waza serve
Section titled “waza serve”Start the interactive dashboard.
waza serve| Flag | Description |
|---|---|
--port | Port (default: 3000) |
--tcp | TCP address for JSON-RPC (e.g., :9000) |
--stdio | Use stdin/stdout for piping |
Examples
Section titled “Examples”waza serve # http://localhost:3000waza serve --port 8080 # http://localhost:8080waza serve --tcp :9000 # JSON-RPC TCP serverGraders
Section titled “Graders”Waza supports multiple grader types for comprehensive evaluation. See the complete Grader Reference for detailed documentation.
Built-in Graders
Section titled “Built-in Graders”| Grader | Purpose |
|---|---|
code | Python/JavaScript assertion-based validation |
regex | Pattern matching in output |
file | File existence and content validation |
diff | Workspace file comparison with snapshots |
behavior | Agent behavior constraints (tool calls, tokens, duration) |
action_sequence | Tool call sequence validation with F1 scoring |
skill_invocation | Skill orchestration sequence validation |
prompt | LLM-as-judge evaluation with rubrics |
tool_constraint | Validate tool usage constraints (e.g., required/forbidden tools, argument patterns) |
trigger_tests | Prompt trigger accuracy detection |
tool_constraint Grader
Section titled “tool_constraint Grader”Validate agent tool usage constraints during evaluation.
graders: - type: tool_constraint name: check_tools config: expect_tools: - tool: "bash" # Required tool call command_pattern: "azd\\s+up" # Optional regex on the command argument - tool: "skill" skill_pattern: "my-skill" # Optional regex on the skill argument - tool: "edit" path_pattern: "\\.go$" # Optional regex on the path argument reject_tools: - tool: "bash" # Prohibited when args match this pattern command_pattern: "rm\\s+-rf" # Optional regex on the command argument - tool: "create_file" # Always prohibitedAll config fields are optional. Omitted fields skip that constraint.
prompt Grader with Pairwise Mode
Section titled “prompt Grader with Pairwise Mode”Use the prompt grader for LLM-as-judge evaluation. In pairwise mode, compare two approaches side-by-side to reduce position bias.
graders: - type: prompt name: code_quality_judge config: mode: pairwise # Enable pairwise comparison (requires --baseline flag) rubric: "Compare these solutions for code quality and correctness" max_tokens: 500Requirements:
- Pairwise mode requires the
--baselineflag onwaza run - Baseline execution must complete before pairwise comparison runs
- Each task is evaluated twice: once without the skill (baseline) and once with it (treatment)
Example:
waza run eval.yaml --baseline -o results.json# Output includes pairwise judge scores comparing baseline vs treatment approachesExit Codes
Section titled “Exit Codes”| Code | Meaning |
|---|---|
0 | Success |
1 | One or more tasks failed |
2 | Configuration or runtime error |
Global Flags
Section titled “Global Flags”| Flag | Description |
|---|---|
--help | Show help |
--version | Show version |
--verbose | Enable debug output |
Environment Variables
Section titled “Environment Variables”| Variable | Description |
|---|---|
GITHUB_TOKEN | Token for Copilot SDK execution |
WAZA_HOME | Config directory (default: ~/.waza) |
WAZA_CACHE | Cache directory (default: .waza-cache) |
Next Steps
Section titled “Next Steps”- Writing Eval Specs — Create benchmarks
- YAML Schema — eval.yaml format
- GitHub Repository — Source code