Writing Eval Specs

A complete reference for writing eval.yaml specifications and task definitions.

eval.yaml Structure

The evaluation spec defines the benchmark configuration, graders, and task files:

name: code-explainer-eval
description: Evaluation suite for code-explainer skill
skill: code-explainer
version: "1.0"

config:
  trials_per_task: 1
  timeout_seconds: 300
  parallel: false
  model: claude-sonnet-4.6

graders:
  - type: text
    name: explains_concepts
    config:
      pattern: "(?i)(function|logic|parameter)"
  - type: code
    name: has_output
    config:
      assertions:
        - "len(output) > 100"

tasks:
  - "tasks/*.yaml"

Top-Level Fields

Field	Type	Required	Description
`name`	string	✓	Eval suite name
`description`	string	✓	What the eval tests
`skill`	string	✓	Associated skill name
`version`	string	✗	Version number (e.g., “1.0”)
`inputs`	object	✗	Key-value map of global template variables (see Template Variables)
`tasks_from`	string	✗	Path to an external YAML file containing the task list
`hooks`	object	✗	Lifecycle hooks that run shell commands at specific points (see Hooks)
`baseline`	bool	✗	Mark this spec as a baseline for A/B comparison

Config Section

The config block controls execution behavior:

config:
  trials_per_task: 1        # Run each task this many times
  timeout_seconds: 300      # Task timeout in seconds
  parallel: false           # Run tasks sequentially (true = concurrent)
  workers: 4                # Parallel workers if parallel: true
  model: claude-sonnet-4.6  # Default model (override with --model)
  judge_model: gpt-4o       # Model for LLM-as-judge graders (optional)
  executor: mock            # mock (local) or copilot-sdk (real API)

Field	Type	Default	Description
`trials_per_task`	int	1	Number of times each task runs (for statistical analysis)
`timeout_seconds`	int	300	Task timeout in seconds
`parallel`	bool	false	Run tasks concurrently
`workers`	int	4	Number of parallel workers
`model`	string	required	Default model for tasks (override with `--model` flag)
`judge_model`	string	(same as `model`)	Model for `prompt`-type graders (LLM-as-judge)
`executor`	string	`copilot-sdk`	Executor: `mock` (local, fast) or `copilot-sdk` (real API)
`max_attempts`	int	0	Maximum retry attempts per task on failure (0 = no retries)
`group_by`	string	—	Group results by a field (e.g., `tags`, `task_id`)
`fail_fast`	bool	false	Stop the entire run on first task failure
`skill_directories`	list[str]	`[]`	Additional directories to search for skills
`required_skills`	list[str]	`[]`	Skills that must be available before running
`mcp_servers`	object	—	MCP server configurations for the evaluation

Common Timeouts:

60 — Quick tasks (single-file review, validation)
300 — Standard tasks (code explanation, analysis)
600 — Complex tasks (multi-file refactoring, design)

Graders Section

Graders validate task outputs. Define once, reuse across tasks:

graders:
  - type: text
    name: checks_logic
    weight: 2.0
    config:
      pattern: "(?i)(function|variable|parameter)"

  - type: code
    name: has_minimum_output
    config:
      assertions:
        - "len(output) > 100"
        - "'success' in output.lower()"

  - type: text
    name: mentions_key_concepts
    config:
      keywords: ["algorithm", "optimization"]
      must_include_all: true

Each grader accepts an optional weight (default 1.0) that controls its influence on the composite score. See Validators & Graders for details.

All graders return:

score: 0.0 to 1.0
passed: boolean
message: human-readable result

See the Validators & Graders guide for all 12 types and examples.

Tasks Section

Tasks define individual test cases. Either inline or from files:

Inline Tasks

tasks:
  - id: basic-001
    name: Basic Usage
    description: Test basic functionality
    inputs:
      prompt: "Explain this code"
      files:
        - path: sample.py
    expected:
      output_contains:
        - "function"
        - "variable"
      behavior:
        max_tool_calls: 5

From Files

Load tasks from YAML files in a directory:

tasks:
  - "tasks/*.yaml"          # All YAML files in tasks/
  - "tasks/basic/*.yaml"    # Specific subdirectory
  - "tasks/advanced.yaml"   # Single file

Task File Format

Individual task files (e.g., tasks/basic-usage.yaml):

id: basic-usage-001
name: Basic Usage - Python Function
description: Test that the skill explains a simple Python function correctly.

tags:
  - basic
  - happy-path

inputs:
  prompt: "Explain this function"
  files:
    - path: sample.py

expected:
  output_contains:
    - "function"
    - "parameter"
    - "return"
  outcomes:
    - type: task_completed
  behavior:
    max_tool_calls: 5
    max_response_time_ms: 30000

Task Fields

Field	Type	Description
`id`	string	Unique task identifier
`name`	string	Human-readable task name
`description`	string	What the task tests
`tags`	array	Tags for filtering (e.g., `["basic", "edge-case"]`)
`inputs`	object	Test inputs (prompt, files)
`expected`	object	Validation rules and expected behavior

Inputs Section

inputs:
  prompt: "Your instruction to the agent"
  files:
    - path: sample.py         # Fixture file (relative to fixtures dir)
      content: |              # Or inline content
        def hello():
            print("Hello")

Prompt supports templating:

inputs:
  prompt: |
    Explain this code:
    {{fixture:sample.py}}

Expected Section

expected:
  # Strings that must appear in output
  output_contains:
    - "function"
    - "parameter"

  # Output must NOT contain these
  output_excludes:
    - "error"
    - "failed"

  # Regex patterns to match
  matches:
    - "returns\\s+.*value"
    - "def\\s+\\w+\\("

  # Task outcomes
  outcomes:
    - type: task_completed
    - type: tool_called
      tool_name: code_analyzer

  # Behavioral constraints
  behavior:
    max_tool_calls: 5
    max_response_time_ms: 30000
    max_tokens: 4096

Fixture Isolation

Fixtures are test files (code, documents, data) that tasks reference.

Important: Each task gets a fresh temp workspace with fixtures copied in. Original fixtures are never modified.

Using Fixtures

Create a fixtures/ directory:

evals/code-explainer/
├── eval.yaml
├── tasks/
│   └── basic-usage.yaml
└── fixtures/
    ├── sample.py
    ├── complex.py
    └── README.md

Reference in tasks:

inputs:
  prompt: "Analyze {{fixture:sample.py}}"
  files:
    - path: sample.py

Directory Structure

# Project mode
evals/
└── code-explainer/
    ├── eval.yaml
    ├── tasks/
    │   ├── basic-usage.yaml
    │   ├── edge-case.yaml
    │   └── should-not-trigger.yaml
    └── fixtures/
        ├── sample.py
        ├── complex.py
        └── nested/
            └── module.py

Specify context directory when running:

waza run eval.yaml --context-dir evals/code-explainer/fixtures

Or use relative paths in eval.yaml if fixtures are adjacent.

Multi-Model Comparison

Run the same eval against multiple models:

# Run with gpt-4o
waza run eval.yaml --model gpt-4o -o gpt4.json

# Run with Claude
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json

# Compare results
waza compare gpt4.json sonnet.json

Override the default model in eval.yaml:

waza run eval.yaml --model gpt-4o  # Overrides config.model

Filtering and Parallel Execution

Filter by Task Name

waza run eval.yaml --task "basic*" --task "edge*"

Filter by Tags

waza run eval.yaml --tags "happy-path"

Parallel Execution

# Run tasks concurrently with 4 workers
waza run eval.yaml --parallel --workers 4

Saving Results

Save eval results for later analysis or comparison:

waza run eval.yaml -o results.json

Output format:

{
  "name": "code-explainer-eval",
  "model": "claude-sonnet-4.6",
  "pass_rate": 0.8,
  "tasks": [
    {
      "id": "basic-001",
      "name": "Basic Usage",
      "passed": true,
      "graders": [
        {
          "name": "checks_logic",
          "passed": true,
          "score": 1.0
        }
      ]
    }
  ]
}

Caching

For iterative testing, cache results:

waza run eval.yaml --cache --cache-dir .waza-cache

Only tasks with changed inputs/config re-run.

Common Patterns

Simple Validation

graders:
  - type: text
    name: format_check
    config:
      pattern: "^[A-Z].*\\.$"  # Sentence starting with capital, ending with period

tasks:
  - id: format-001
    inputs:
      prompt: "Write a single sentence"
    expected:
      matches:
        - "^[A-Z].*\\.$"

Multi-Criteria Scoring

graders:
  - type: code
    name: completeness
    config:
      assertions:
        - "len(output) > 500"
        - "'function' in output"
        - "'parameter' in output"

tasks:
  - id: complete-001
    inputs:
      prompt: "Explain this function"
    expected:
      # All 3 assertions must pass

Behavioral Constraints

tasks:
  - id: efficient-001
    inputs:
      prompt: "Refactor this code"
    expected:
      behavior:
        max_tool_calls: 3        # Efficient
        max_response_time_ms: 5000  # Quick
        max_tokens: 1000         # Concise

Hooks

Lifecycle hooks run shell commands at specific points during an evaluation. Use them for setup, teardown, or validation.

hooks:
  before_run:
    - command: "npm install"
      working_directory: "./fixtures"
      error_on_fail: true
  after_run:
    - command: "bash cleanup.sh"
  before_task:
    - command: "echo Starting task"
  after_task:
    - command: "bash collect-metrics.sh"

Hook	When it runs
`before_run`	Once, before the entire evaluation starts
`after_run`	Once, after all tasks complete
`before_task`	Before each individual task
`after_task`	After each individual task

Each hook entry:

Field	Type	Default	Description
`command`	string	(required)	Shell command to execute
`working_directory`	string	`.`	Working directory for the command
`exit_codes`	list[int]	`[0]`	Acceptable exit codes
`error_on_fail`	bool	false	Abort the run if this hook fails

Template Variables

Use the inputs field to define global template variables that are substituted into task prompts:

inputs:
  language: python
  framework: fastapi

tasks:
  - id: scaffold-001
    inputs:
      prompt: "Create a {{language}} app using {{framework}}"

Prompt templating also supports fixture file injection:

inputs:
  prompt: |
    Explain this code:
    {{fixture:sample.py}}

The {{fixture:filename}} syntax inlines the content of a file from the fixtures directory into the prompt.

External Task Lists

Use tasks_from to load task definitions from a separate YAML file:

name: shared-eval
tasks_from: shared-tasks.yaml

config:
  trials_per_task: 3
  model: claude-sonnet-4.6

This is useful when multiple eval specs share the same task set but differ in config or graders.

Best Practices

Clear task descriptions — Future reviewers should understand what’s being tested
Realistic validators — Don’t over-specify. A few key checks beat 20 strict rules
Fixture diversity — Include basic, edge case, and negative test fixtures
Tag your tasks — Makes filtering and analysis easier
Use timeout appropriately — Too short = false failures, too long = slow tests
Reuse graders — Define once, apply across multiple tasks
Version your evals — Track improvements with version numbers

Next Steps

Validators & Graders — Reference for all grader types
Web Dashboard — Explore results interactively
CLI Reference — All commands and flags