Skip to content

Writing Eval Specs

A complete reference for writing eval.yaml specifications and task definitions.

The evaluation spec defines the benchmark configuration, graders, and task files:

name: code-explainer-eval
description: Evaluation suite for code-explainer skill
skill: code-explainer
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
parallel: false
model: claude-sonnet-4.6
graders:
- type: text
name: explains_concepts
config:
pattern: "(?i)(function|logic|parameter)"
- type: code
name: has_output
config:
assertions:
- "len(output) > 100"
tasks:
- "tasks/*.yaml"
FieldTypeRequiredDescription
namestringEval suite name
descriptionstringWhat the eval tests
skillstringAssociated skill name
versionstringVersion number (e.g., “1.0”)
inputsobjectKey-value map of global template variables (see Template Variables)
tasks_fromstringPath to an external YAML file containing the task list
hooksobjectLifecycle hooks that run shell commands at specific points (see Hooks)
baselineboolMark this spec as a baseline for A/B comparison

The config block controls execution behavior:

config:
trials_per_task: 1 # Run each task this many times
timeout_seconds: 300 # Task timeout in seconds
parallel: false # Run tasks sequentially (true = concurrent)
workers: 4 # Parallel workers if parallel: true
model: claude-sonnet-4.6 # Default model (override with --model)
judge_model: gpt-4o # Model for LLM-as-judge graders (optional)
executor: mock # mock (local) or copilot-sdk (real API)
FieldTypeDefaultDescription
trials_per_taskint1Number of times each task runs (for statistical analysis)
timeout_secondsint300Task timeout in seconds
parallelboolfalseRun tasks concurrently
workersint4Number of parallel workers
modelstringrequiredDefault model for tasks (override with --model flag)
judge_modelstring(same as model)Model for prompt-type graders (LLM-as-judge)
executorstringcopilot-sdkExecutor: mock (local, fast) or copilot-sdk (real API)
max_attemptsint0Maximum retry attempts per task on failure (0 = no retries)
group_bystringGroup results by a field (e.g., tags, task_id)
fail_fastboolfalseStop the entire run on first task failure
skill_directorieslist[str][]Additional directories to search for skills
required_skillslist[str][]Skills that must be available before running
mcp_serversobjectMCP server configurations for the evaluation

Common Timeouts:

  • 60 — Quick tasks (single-file review, validation)
  • 300 — Standard tasks (code explanation, analysis)
  • 600 — Complex tasks (multi-file refactoring, design)

Graders validate task outputs. Define once, reuse across tasks:

graders:
- type: text
name: checks_logic
weight: 2.0
config:
pattern: "(?i)(function|variable|parameter)"
- type: code
name: has_minimum_output
config:
assertions:
- "len(output) > 100"
- "'success' in output.lower()"
- type: text
name: mentions_key_concepts
config:
keywords: ["algorithm", "optimization"]
must_include_all: true

Each grader accepts an optional weight (default 1.0) that controls its influence on the composite score. See Validators & Graders for details.

All graders return:

  • score: 0.0 to 1.0
  • passed: boolean
  • message: human-readable result

See the Validators & Graders guide for all 12 types and examples.

Tasks define individual test cases. Either inline or from files:

tasks:
- id: basic-001
name: Basic Usage
description: Test basic functionality
inputs:
prompt: "Explain this code"
files:
- path: sample.py
expected:
output_contains:
- "function"
- "variable"
behavior:
max_tool_calls: 5

Load tasks from YAML files in a directory:

tasks:
- "tasks/*.yaml" # All YAML files in tasks/
- "tasks/basic/*.yaml" # Specific subdirectory
- "tasks/advanced.yaml" # Single file

Individual task files (e.g., tasks/basic-usage.yaml):

id: basic-usage-001
name: Basic Usage - Python Function
description: Test that the skill explains a simple Python function correctly.
tags:
- basic
- happy-path
inputs:
prompt: "Explain this function"
files:
- path: sample.py
expected:
output_contains:
- "function"
- "parameter"
- "return"
outcomes:
- type: task_completed
behavior:
max_tool_calls: 5
max_response_time_ms: 30000
FieldTypeDescription
idstringUnique task identifier
namestringHuman-readable task name
descriptionstringWhat the task tests
tagsarrayTags for filtering (e.g., ["basic", "edge-case"])
inputsobjectTest inputs (prompt, files)
expectedobjectValidation rules and expected behavior
inputs:
prompt: "Your instruction to the agent"
files:
- path: sample.py # Fixture file (relative to fixtures dir)
content: | # Or inline content
def hello():
print("Hello")

Prompt supports templating:

inputs:
prompt: |
Explain this code:
{{fixture:sample.py}}
expected:
# Strings that must appear in output
output_contains:
- "function"
- "parameter"
# Output must NOT contain these
output_excludes:
- "error"
- "failed"
# Regex patterns to match
matches:
- "returns\\s+.*value"
- "def\\s+\\w+\\("
# Task outcomes
outcomes:
- type: task_completed
- type: tool_called
tool_name: code_analyzer
# Behavioral constraints
behavior:
max_tool_calls: 5
max_response_time_ms: 30000
max_tokens: 4096

Fixtures are test files (code, documents, data) that tasks reference.

Important: Each task gets a fresh temp workspace with fixtures copied in. Original fixtures are never modified.

Create a fixtures/ directory:

evals/code-explainer/
├── eval.yaml
├── tasks/
│ └── basic-usage.yaml
└── fixtures/
├── sample.py
├── complex.py
└── README.md

Reference in tasks:

inputs:
prompt: "Analyze {{fixture:sample.py}}"
files:
- path: sample.py
Terminal window
# Project mode
evals/
└── code-explainer/
├── eval.yaml
├── tasks/
├── basic-usage.yaml
├── edge-case.yaml
└── should-not-trigger.yaml
└── fixtures/
├── sample.py
├── complex.py
└── nested/
└── module.py

Specify context directory when running:

Terminal window
waza run eval.yaml --context-dir evals/code-explainer/fixtures

Or use relative paths in eval.yaml if fixtures are adjacent.

Run the same eval against multiple models:

Terminal window
# Run with gpt-4o
waza run eval.yaml --model gpt-4o -o gpt4.json
# Run with Claude
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
# Compare results
waza compare gpt4.json sonnet.json

Override the default model in eval.yaml:

Terminal window
waza run eval.yaml --model gpt-4o # Overrides config.model
Terminal window
waza run eval.yaml --task "basic*" --task "edge*"
Terminal window
waza run eval.yaml --tags "happy-path"
Terminal window
# Run tasks concurrently with 4 workers
waza run eval.yaml --parallel --workers 4

Save eval results for later analysis or comparison:

Terminal window
waza run eval.yaml -o results.json

Output format:

{
"name": "code-explainer-eval",
"model": "claude-sonnet-4.6",
"pass_rate": 0.8,
"tasks": [
{
"id": "basic-001",
"name": "Basic Usage",
"passed": true,
"graders": [
{
"name": "checks_logic",
"passed": true,
"score": 1.0
}
]
}
]
}

For iterative testing, cache results:

Terminal window
waza run eval.yaml --cache --cache-dir .waza-cache

Only tasks with changed inputs/config re-run.

graders:
- type: text
name: format_check
config:
pattern: "^[A-Z].*\\.$" # Sentence starting with capital, ending with period
tasks:
- id: format-001
inputs:
prompt: "Write a single sentence"
expected:
matches:
- "^[A-Z].*\\.$"
graders:
- type: code
name: completeness
config:
assertions:
- "len(output) > 500"
- "'function' in output"
- "'parameter' in output"
tasks:
- id: complete-001
inputs:
prompt: "Explain this function"
expected:
# All 3 assertions must pass
tasks:
- id: efficient-001
inputs:
prompt: "Refactor this code"
expected:
behavior:
max_tool_calls: 3 # Efficient
max_response_time_ms: 5000 # Quick
max_tokens: 1000 # Concise

Lifecycle hooks run shell commands at specific points during an evaluation. Use them for setup, teardown, or validation.

hooks:
before_run:
- command: "npm install"
working_directory: "./fixtures"
error_on_fail: true
after_run:
- command: "bash cleanup.sh"
before_task:
- command: "echo Starting task"
after_task:
- command: "bash collect-metrics.sh"
HookWhen it runs
before_runOnce, before the entire evaluation starts
after_runOnce, after all tasks complete
before_taskBefore each individual task
after_taskAfter each individual task

Each hook entry:

FieldTypeDefaultDescription
commandstring(required)Shell command to execute
working_directorystring.Working directory for the command
exit_codeslist[int][0]Acceptable exit codes
error_on_failboolfalseAbort the run if this hook fails

Use the inputs field to define global template variables that are substituted into task prompts:

inputs:
language: python
framework: fastapi
tasks:
- id: scaffold-001
inputs:
prompt: "Create a {{language}} app using {{framework}}"

Prompt templating also supports fixture file injection:

inputs:
prompt: |
Explain this code:
{{fixture:sample.py}}

The {{fixture:filename}} syntax inlines the content of a file from the fixtures directory into the prompt.


Use tasks_from to load task definitions from a separate YAML file:

name: shared-eval
tasks_from: shared-tasks.yaml
config:
trials_per_task: 3
model: claude-sonnet-4.6

This is useful when multiple eval specs share the same task set but differ in config or graders.


  1. Clear task descriptions — Future reviewers should understand what’s being tested
  2. Realistic validators — Don’t over-specify. A few key checks beat 20 strict rules
  3. Fixture diversity — Include basic, edge case, and negative test fixtures
  4. Tag your tasks — Makes filtering and analysis easier
  5. Use timeout appropriately — Too short = false failures, too long = slow tests
  6. Reuse graders — Define once, apply across multiple tasks
  7. Version your evals — Track improvements with version numbers