Eval Authoring Guide

Agent Rules and Skill Paths

Use the built-in agentv:agent-rules extension when an eval needs to stage or expose agent-facing rules, skills, hooks, or subagents. It runs after workspace.template and workspace.repos materialize, then writes agent_rules_paths into provider context and result metadata.

extensions:
  - id: agentv:agent-rules
    hook: beforeAll
    skills: agent-rules/skills
    hooks: agent-rules/hooks
    agents: agent-rules/agents
    rules: agent-rules/AGENTS.md

workspace:
  template: ./workspace-template
  repos:
    - path: ./app
      repo: acme/app
      commit: main

Configured paths are resolved relative to the eval file and staged under the materialized workspace. If you write the shorthand form, AgentV discovers conventional rule locations already present in the workspace:

extensions:
  - agentv:agent-rules

Do not move repo acquisition into agentv:agent-rules. Repositories remain first-class workspace provenance through workspace.repos.

Custom Lifecycle Setup

Use file extensions for setup that is not repo provisioning:

extensions:
  - file://scripts/setup.mjs:beforeAll
  - file://scripts/setup.mjs:beforeEach
  - file://scripts/setup.mjs:afterEach
  - file://scripts/setup.mjs:afterAll

Each file hook exports a function with the matching name. The function receives context such as workspace_path, test_id, eval_run_id, case_input, and case_metadata.

Workspace Limitations: No GitHub Remote

Workspace-based evals are sandboxed — there is no GitHub remote, no PRs, and no issue tracker. Tests that ask agents to interact with GitHub will fail.

What to test instead

Test decision-making discipline, not git infrastructure operations:

Risk classification (“should this change be shipped?”)
Scope assessment (“does this PR do too much?”)
Review judgment (“what issues does this diff have?”)

How to frame prompts

Don’t write imperative prompts that require a remote:

# BAD — requires GitHub remote
- id: merge-check
  input: "Merge PR #42 if it looks safe"

Do frame prompts as hypothetical with inline context:

# GOOD — self-contained, no remote needed
- id: merge-check
  input: |
    Here is what PR #42 changes:

    ```diff
    -  timeout: 30_000
    +  timeout: 5_000

The PR description says: “Reduce timeout for faster feedback.” Should this be shipped? What risks do you see?

## Workspace State Consistency: Git Diff Verification

Agents verify `git diff` against prompt claims. If your prompt says "The PR modifies `auth.ts`" but the workspace has no such change, the agent will flag the mismatch. This is **correct agent behavior** — don't try to suppress it.

### Rules

1. If a prompt references specific code changes, the workspace **must** contain those exact changes
2. Or frame prompts as hypothetical: describe changes inline rather than claiming they exist in the workspace
3. Use `before_each` hooks to set up per-test git state when tests need different diffs

### Example: per-test git state

```yaml
workspace:
  template: ./workspace-template
  hooks:
    before_each:
      command:
        - node
        - ../scripts/apply-test-diff.mjs

tests:
  - id: risky-change
    metadata:
      diff_file: diffs/risky-timeout-change.patch
    input: "Review the current changes and assess risk."

The before_each hook reads metadata.diff_file from the AgentV payload and applies the patch to the workspace before each test runs.

Hypothetical framing pattern

When you don’t want to maintain actual diffs, describe the changes inline:

- id: ship-decision
  input: |
    You are reviewing a proposed change. Here is the diff:

    ```diff
    --- a/src/config.ts
    +++ b/src/config.ts
    @@ -10,3 +10,3 @@
    -  retries: 3,
    +  retries: 0,

The author says: “Disable retries to reduce latency.” Should this be shipped?

This avoids workspace state issues entirely — the agent evaluates the diff as presented without checking `git diff`.

## Historical Repo State: Pin the Checkout

If a test asks the agent to inspect how a repository looked at a past commit,
declare that checkout in `workspace.repos[]`. Do not rely on prompt prose that
mentions a SHA without materializing the repo.

```yaml
workspace:
  repos:
    - path: ./agentv
      repo: https://github.com/EntityProcess/agentv.git
      commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215

tests:
  - id: verification-learning-capture
    input: |
      The eval harness has prepared ./agentv at the historical commit.
      Use that checkout to decide which durable guidance should change.
    expected_output: |
      The durable repo change is to update .agents/verification.md with the
      reusable verification workflow lessons.
    assertions:
      - The answer uses the pinned ./agentv checkout to verify the existing guidance.
      - The answer preserves the historical commit SHA as context.