Skip to content

CLI Commands

Complete reference for all waza CLI commands and their options.

Terminal window
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
waza --version

Run an evaluation benchmark.

Terminal window
waza run [eval.yaml | skill-name]
ArgumentDescription
[eval.yaml]Path to evaluation spec file
[skill-name]Skill name (auto-detects eval.yaml)
(none)Auto-detect using workspace detection
FlagShortTypeDefaultDescription
--context-dir-cstring./fixturesFixtures directory path
--output-ostringSave results JSON to file
--output-dir-dstringSave output artifacts to directory
--verbose-vboolfalseDetailed progress output
--parallelboolfalseRun tasks concurrently
--workers-wint4Number of concurrent workers
--task-tstringFilter tasks by name (repeatable)
--tagsstringFilter tasks by tags (repeatable)
--model-mstringOverride model (repeatable)
--judge-modelstringModel for LLM-as-judge graders (overrides execution model)
--cacheboolfalseEnable result caching
--cache-dirstring.waza-cacheCache directory path
--format-fstringdefaultOutput format: default, github-comment
--reporterstring[]Output reporters: json, junit:<path> (repeatable)
--timeoutint300Task timeout in seconds
--baselineboolfalseA/B testing mode — runs each task twice (without skill = baseline, with skill = normal) and computes improvement scores
--discoverstringAuto skill discovery — walks directory tree for SKILL.md + eval.yaml (root/tests/evals)
--strictboolfalseFail if any SKILL.md lacks eval coverage (use with --discover)
Terminal window
# Run all tasks
waza run eval.yaml -v
# Run specific skill
waza run code-explainer
# Specify fixtures directory
waza run eval.yaml -c ./fixtures -v
# Save results
waza run eval.yaml -o results.json
# Filter to specific tasks
waza run eval.yaml --task "basic*" --task "edge*"
# Multiple models (parallel)
waza run eval.yaml --model gpt-4o --model claude-sonnet-4.6
# Use a different judge model for LLM-as-judge graders
waza run eval.yaml --model gpt-4o --judge-model claude-opus-4.6
# Parallel execution with 8 workers
waza run eval.yaml --parallel --workers 8
# With caching
waza run eval.yaml --cache --cache-dir .waza-cache
# Generate JUnit XML for CI test reporting
waza run eval.yaml --reporter junit:results.xml
# A/B testing: baseline vs skill performance
waza run eval.yaml --baseline -o results.json
# Output includes improvement breakdown (quality, tokens, turns, time, completion)
# Auto skill discovery
waza run --discover ./skills/
# Auto discovery with strict mode (fail if any SKILL.md lacks eval coverage)
waza run --discover --strict ./skills/

Initialize a new waza project.

Terminal window
waza init [directory]
ArgumentDescription
[directory]Project directory (default: current dir)
FlagDescription
--no-skillSkip first skill creation prompt
project-root/
├── skills/
├── evals/
├── .github/workflows/eval.yml
├── .gitignore
└── README.md
Terminal window
waza init my-project
waza init my-project --no-skill

Create a new skill.

Terminal window
waza new [skill-name]

Project mode (inside a skills/ directory):

Terminal window
cd my-skills-repo
waza new code-explainer
# Creates skills/code-explainer/SKILL.md + evals/code-explainer/

Standalone mode (no skills/ directory):

Terminal window
waza new my-skill
# Creates my-skill/ with all files
FlagDescription
--templateTemplate pack (coming soon)
--interactiveInteractive metadata collection
Terminal window
# Interactive wizard
waza new code-analyzer --interactive
# Non-interactive (CI/CD)
waza new code-analyzer << EOF
Code Analyzer
Analyzes code for patterns and issues
code, analysis
EOF

Validate skill compliance and readiness.

Terminal window
waza check [skill-name | skill-path]
ArgumentDescription
[skill-name]Skill name (e.g., code-explainer)
[skill-path]Path to skill directory
(none)Check all skills in workspace
FlagDescription
--verboseDetailed compliance report
--formatOutput format: text (default), json
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: code-explainer
📋 Compliance Score: High
✅ Excellent! Your skill meets all requirements.
📊 Token Budget: 420 / 500 tokens
✅ Within budget.
🧪 Evaluation Suite: Found
✅ eval.yaml detected.
✅ Your skill is ready for submission!
Terminal window
waza check code-explainer
waza check ./skills/code-explainer
waza check --verbose

Compare evaluation results across models.

Terminal window
waza compare [results-1.json] [results-2.json] ...
ArgumentDescription
[results-N.json]Result files to compare (2+ required)
FlagDescription
--formatOutput format: table (default), json
Terminal window
waza compare gpt4.json sonnet.json
waza compare gpt4.json sonnet.json opus.json
waza compare results-*.json --format json

Generate suggested eval artifacts from a skill’s SKILL.md using an LLM.

Terminal window
waza suggest <skill-path>
FlagDescription
--modelModel to use for suggestions (default: project default model)
--dry-runPrint suggestions to stdout (default)
--applyWrite suggested files to disk
--output-dirOutput directory (default: <skill-path>/evals)
--formatOutput format: yaml (default), json
Terminal window
# Preview suggestions
waza suggest skills/code-explainer --dry-run
# Write eval/task/fixture files
waza suggest skills/code-explainer --apply
# JSON output
waza suggest skills/code-explainer --format json

Token budget management.

Count tokens in skill files.

Terminal window
waza tokens count [path]
Terminal window
waza tokens count skills/code-explainer/SKILL.md
waza tokens count skills/

Check token usage against budget.

Terminal window
waza tokens check [skill-name]
Terminal window
waza tokens check code-explainer
# Output:
# code-explainer: 420 / 500 tokens (84%)
# ✅ Within budget

Structural analysis of SKILL.md files — reports token count, section count, code block count, and workflow step detection.

Terminal window
waza tokens profile [skill-name | path]

Flags: --format text|json, --tokenizer bpe|estimate

Example:

📊 my-skill: 1,722 tokens (detailed ✓), 8 sections, 4 code blocks
⚠️ no workflow steps detected

Warnings: no workflow steps, >2,500 tokens, fewer than 3 sections.

Get optimization suggestions.

Terminal window
waza tokens suggest [skill-name]

Analyzes SKILL.md and suggests:

  • Sections to shorten
  • Removable content
  • Restructuring opportunities

Manage evaluation results from cloud storage or local storage.

List evaluation runs from configured cloud storage.

Terminal window
waza results list
waza results list --limit 20
waza results list --format json
FlagDescription
--limit <n>Maximum results to display (default: 10)
--formatOutput format: table or json (default: table)

Compare two evaluation runs side by side.

Terminal window
waza results compare run-id-1 run-id-2
waza results compare run-id-1 run-id-2 --format json
FlagDescription
--formatOutput format: table or json (default: table)

Improve skill compliance iteratively.

Terminal window
waza dev [skill-name | skill-path]
FlagDescription
--targetTarget level: low, medium, high
--max-iterationsMax improvement loops (default: 5)
--autoAuto-apply without prompting
--fastSkip integration tests
Terminal window
waza dev code-explainer --target high --auto

Iteratively:

  1. Scores current compliance
  2. Identifies issues
  3. Suggests improvements
  4. Applies changes
  5. Re-scores
  6. Repeats until target reached

Start the interactive dashboard.

Terminal window
waza serve
FlagDescription
--portPort (default: 3000)
--tcpTCP address for JSON-RPC (e.g., :9000)
--stdioUse stdin/stdout for piping
Terminal window
waza serve # http://localhost:3000
waza serve --port 8080 # http://localhost:8080
waza serve --tcp :9000 # JSON-RPC TCP server

Waza supports multiple grader types for comprehensive evaluation. See the complete Grader Reference for detailed documentation.

GraderPurpose
codePython/JavaScript assertion-based validation
regexPattern matching in output
fileFile existence and content validation
diffWorkspace file comparison with snapshots
behaviorAgent behavior constraints (tool calls, tokens, duration)
action_sequenceTool call sequence validation with F1 scoring
skill_invocationSkill orchestration sequence validation
promptLLM-as-judge evaluation with rubrics
tool_constraintValidate tool usage constraints (e.g., required/forbidden tools, argument patterns)
trigger_testsPrompt trigger accuracy detection

Validate agent tool usage constraints during evaluation.

graders:
- type: tool_constraint
name: check_tools
config:
expect_tools:
- tool: "bash" # Required tool call
command_pattern: "azd\\s+up" # Optional regex on the command argument
- tool: "skill"
skill_pattern: "my-skill" # Optional regex on the skill argument
- tool: "edit"
path_pattern: "\\.go$" # Optional regex on the path argument
reject_tools:
- tool: "bash" # Prohibited when args match this pattern
command_pattern: "rm\\s+-rf" # Optional regex on the command argument
- tool: "create_file" # Always prohibited

All config fields are optional. Omitted fields skip that constraint.

Use the prompt grader for LLM-as-judge evaluation. In pairwise mode, compare two approaches side-by-side to reduce position bias.

graders:
- type: prompt
name: code_quality_judge
config:
mode: pairwise # Enable pairwise comparison (requires --baseline flag)
rubric: "Compare these solutions for code quality and correctness"
max_tokens: 500

Requirements:

  • Pairwise mode requires the --baseline flag on waza run
  • Baseline execution must complete before pairwise comparison runs
  • Each task is evaluated twice: once without the skill (baseline) and once with it (treatment)

Example:

Terminal window
waza run eval.yaml --baseline -o results.json
# Output includes pairwise judge scores comparing baseline vs treatment approaches
CodeMeaning
0Success
1One or more tasks failed
2Configuration or runtime error
FlagDescription
--helpShow help
--versionShow version
--verboseEnable debug output
VariableDescription
GITHUB_TOKENToken for Copilot SDK execution
WAZA_HOMEConfig directory (default: ~/.waza)
WAZA_CACHECache directory (default: .waza-cache)