CLI Commands

Complete reference for all waza CLI commands and their options.

Installation

curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
waza --version

waza run

Run an evaluation benchmark.

waza run [eval.yaml | skill-name]

Arguments

Argument	Description
`[eval.yaml]`	Path to evaluation spec file
`[skill-name]`	Skill name (auto-detects eval.yaml)
(none)	Auto-detect using workspace detection

Flags

Flag	Short	Type	Default	Description
`--context-dir`	`-c`	string	`./fixtures`	Fixtures directory path
`--output`	`-o`	string		Save results JSON to file
`--output-dir`	`-d`	string		Save output artifacts to directory
`--verbose`	`-v`	bool	false	Detailed progress output
`--parallel`		bool	false	Run tasks concurrently
`--workers`	`-w`	int	4	Number of concurrent workers
`--task`	`-t`	string		Filter tasks by name (repeatable)
`--tags`		string		Filter tasks by tags (repeatable)
`--model`	`-m`	string		Override model (repeatable)
`--judge-model`		string		Model for LLM-as-judge graders (overrides execution model)
`--cache`		bool	false	Enable result caching
`--cache-dir`		string	`.waza-cache`	Cache directory path
`--format`	`-f`	string	`default`	Output format: `default`, `github-comment`
`--reporter`		string[]		Output reporters: `json`, `junit:<path>` (repeatable)
`--timeout`		int	300	Task timeout in seconds
`--baseline`		bool	false	A/B testing mode — runs each task twice (without skill = baseline, with skill = normal) and computes improvement scores
`--discover`		string		Auto skill discovery — walks directory tree for SKILL.md + eval.yaml (root/tests/evals)
`--strict`		bool	false	Fail if any SKILL.md lacks eval coverage (use with `--discover`)

Examples

# Run all tasks
waza run eval.yaml -v

# Run specific skill
waza run code-explainer

# Specify fixtures directory
waza run eval.yaml -c ./fixtures -v

# Save results
waza run eval.yaml -o results.json

# Filter to specific tasks
waza run eval.yaml --task "basic*" --task "edge*"

# Multiple models (parallel)
waza run eval.yaml --model gpt-4o --model claude-sonnet-4.6

# Use a different judge model for LLM-as-judge graders
waza run eval.yaml --model gpt-4o --judge-model claude-opus-4.6

# Parallel execution with 8 workers
waza run eval.yaml --parallel --workers 8

# With caching
waza run eval.yaml --cache --cache-dir .waza-cache

# Generate JUnit XML for CI test reporting
waza run eval.yaml --reporter junit:results.xml

# A/B testing: baseline vs skill performance
waza run eval.yaml --baseline -o results.json
# Output includes improvement breakdown (quality, tokens, turns, time, completion)

# Auto skill discovery
waza run --discover ./skills/

# Auto discovery with strict mode (fail if any SKILL.md lacks eval coverage)
waza run --discover --strict ./skills/

waza init

Initialize a new waza project.

waza init [directory]

Arguments

Argument	Description
`[directory]`	Project directory (default: current dir)

Flags

Flag	Description
`--no-skill`	Skip first skill creation prompt

Creates

project-root/
├── skills/
├── evals/
├── .github/workflows/eval.yml
├── .gitignore
└── README.md

Examples

waza init my-project
waza init my-project --no-skill

waza new

Create a new skill.

waza new [skill-name]

Modes

Project mode (inside a skills/ directory):

cd my-skills-repo
waza new code-explainer
# Creates skills/code-explainer/SKILL.md + evals/code-explainer/

Standalone mode (no skills/ directory):

waza new my-skill
# Creates my-skill/ with all files

Flags

Flag	Description
`--template`	Template pack (coming soon)
`--interactive`	Interactive metadata collection

Examples

# Interactive wizard
waza new code-analyzer --interactive

# Non-interactive (CI/CD)
waza new code-analyzer << EOF
Code Analyzer
Analyzes code for patterns and issues
code, analysis
EOF

waza check

Validate skill compliance and readiness.

waza check [skill-name | skill-path]

Arguments

Argument	Description
`[skill-name]`	Skill name (e.g., `code-explainer`)
`[skill-path]`	Path to skill directory
(none)	Check all skills in workspace

Flags

Flag	Description
`--verbose`	Detailed compliance report
`--format`	Output format: `text` (default), `json`

Output

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: code-explainer

📋 Compliance Score: High
   ✅ Excellent! Your skill meets all requirements.

📊 Token Budget: 420 / 500 tokens
   ✅ Within budget.

🧪 Evaluation Suite: Found
   ✅ eval.yaml detected.

✅ Your skill is ready for submission!

Examples

waza check code-explainer
waza check ./skills/code-explainer
waza check --verbose

waza compare

Compare evaluation results across models.

waza compare [results-1.json] [results-2.json] ...

Arguments

Argument	Description
`[results-N.json]`	Result files to compare (2+ required)

Flags

Flag	Description
`--format`	Output format: `table` (default), `json`

Examples

waza compare gpt4.json sonnet.json
waza compare gpt4.json sonnet.json opus.json
waza compare results-*.json --format json

waza suggest

Generate suggested eval artifacts from a skill’s SKILL.md using an LLM.

waza suggest <skill-path>

Flags

Flag	Description
`--model`	Model to use for suggestions (default: project default model)
`--dry-run`	Print suggestions to stdout (default)
`--apply`	Write suggested files to disk
`--output-dir`	Output directory (default: `<skill-path>/evals`)
`--format`	Output format: `yaml` (default), `json`

Examples

# Preview suggestions
waza suggest skills/code-explainer --dry-run

# Write eval/task/fixture files
waza suggest skills/code-explainer --apply

# JSON output
waza suggest skills/code-explainer --format json

waza tokens

Token budget management.

waza tokens count

Count tokens in skill files.

waza tokens count [path]

waza tokens count skills/code-explainer/SKILL.md
waza tokens count skills/

waza tokens check

Check token usage against budget.

waza tokens check [skill-name]

waza tokens check code-explainer
# Output:
# code-explainer: 420 / 500 tokens (84%)
# ✅ Within budget

waza tokens profile

Structural analysis of SKILL.md files — reports token count, section count, code block count, and workflow step detection.

waza tokens profile [skill-name | path]

Flags: --format text|json, --tokenizer bpe|estimate

Example:

📊 my-skill: 1,722 tokens (detailed ✓), 8 sections, 4 code blocks
   ⚠️  no workflow steps detected

Warnings: no workflow steps, >2,500 tokens, fewer than 3 sections.

waza tokens suggest

Get optimization suggestions.

waza tokens suggest [skill-name]

Analyzes SKILL.md and suggests:

Sections to shorten
Removable content
Restructuring opportunities

waza results

Manage evaluation results from cloud storage or local storage.

waza results list

List evaluation runs from configured cloud storage.

waza results list
waza results list --limit 20
waza results list --format json

Flags

Flag	Description
`--limit <n>`	Maximum results to display (default: 10)
`--format`	Output format: `table` or `json` (default: `table`)

waza results compare

Compare two evaluation runs side by side.

waza results compare run-id-1 run-id-2
waza results compare run-id-1 run-id-2 --format json

Flags

Flag	Description
`--format`	Output format: `table` or `json` (default: `table`)

waza dev

Improve skill compliance iteratively.

waza dev [skill-name | skill-path]

Flags

Flag	Description
`--target`	Target level: `low`, `medium`, `high`
`--max-iterations`	Max improvement loops (default: 5)
`--auto`	Auto-apply without prompting
`--fast`	Skip integration tests

Workflow

waza dev code-explainer --target high --auto

Iteratively:

Scores current compliance
Identifies issues
Suggests improvements
Applies changes
Re-scores
Repeats until target reached

waza serve

Start the interactive dashboard.

waza serve

Flags

Flag	Description
`--port`	Port (default: 3000)
`--tcp`	TCP address for JSON-RPC (e.g., `:9000`)
`--stdio`	Use stdin/stdout for piping

Examples

waza serve                    # http://localhost:3000
waza serve --port 8080        # http://localhost:8080
waza serve --tcp :9000        # JSON-RPC TCP server

Graders

Waza supports multiple grader types for comprehensive evaluation. See the complete Grader Reference for detailed documentation.

Built-in Graders

Grader	Purpose
`code`	Python/JavaScript assertion-based validation
`regex`	Pattern matching in output
`file`	File existence and content validation
`diff`	Workspace file comparison with snapshots
`behavior`	Agent behavior constraints (tool calls, tokens, duration)
`action_sequence`	Tool call sequence validation with F1 scoring
`skill_invocation`	Skill orchestration sequence validation
`prompt`	LLM-as-judge evaluation with rubrics
`tool_constraint`	Validate tool usage constraints (e.g., required/forbidden tools, argument patterns)
`trigger_tests`	Prompt trigger accuracy detection

tool_constraint Grader

Validate agent tool usage constraints during evaluation.

graders:
  - type: tool_constraint
    name: check_tools
    config:
      expect_tools:
        - tool: "bash"                         # Required tool call
          command_pattern: "azd\\s+up"         # Optional regex on the command argument
        - tool: "skill"
          skill_pattern: "my-skill"            # Optional regex on the skill argument
        - tool: "edit"
          path_pattern: "\\.go$"               # Optional regex on the path argument
      reject_tools:
        - tool: "bash"                         # Prohibited when args match this pattern
          command_pattern: "rm\\s+-rf"         # Optional regex on the command argument
        - tool: "create_file"                  # Always prohibited

All config fields are optional. Omitted fields skip that constraint.

prompt Grader with Pairwise Mode

Use the prompt grader for LLM-as-judge evaluation. In pairwise mode, compare two approaches side-by-side to reduce position bias.

graders:
  - type: prompt
    name: code_quality_judge
    config:
      mode: pairwise                        # Enable pairwise comparison (requires --baseline flag)
      rubric: "Compare these solutions for code quality and correctness"
      max_tokens: 500

Requirements:

Pairwise mode requires the --baseline flag on waza run
Baseline execution must complete before pairwise comparison runs
Each task is evaluated twice: once without the skill (baseline) and once with it (treatment)

Example:

waza run eval.yaml --baseline -o results.json
# Output includes pairwise judge scores comparing baseline vs treatment approaches

Exit Codes

Code	Meaning
`0`	Success
`1`	One or more tasks failed
`2`	Configuration or runtime error

Global Flags

Flag	Description
`--help`	Show help
`--version`	Show version
`--verbose`	Enable debug output

Environment Variables

Variable	Description
`GITHUB_TOKEN`	Token for Copilot SDK execution
`WAZA_HOME`	Config directory (default: `~/.waza`)
`WAZA_CACHE`	Cache directory (default: `.waza-cache`)

Next Steps

Writing Eval Specs — Create benchmarks
YAML Schema — eval.yaml format
GitHub Repository — Source code