Skip to content

Eval Authoring Guide

Use the built-in agentv:agent-rules extension when an eval needs to stage or expose agent-facing rules, skills, hooks, or subagents. It runs after workspace.template and workspace.repos materialize, then writes agent_rules_paths into provider context and result metadata.

extensions:
- id: agentv:agent-rules
hook: beforeAll
skills: agent-rules/skills
hooks: agent-rules/hooks
agents: agent-rules/agents
rules: agent-rules/AGENTS.md
workspace:
template: ./workspace-template
repos:
- path: ./app
repo: acme/app
commit: main

Configured paths are resolved relative to the eval file and staged under the materialized workspace. If you write the shorthand form, AgentV discovers conventional rule locations already present in the workspace:

extensions:
- agentv:agent-rules

Do not move repo acquisition into agentv:agent-rules. Repositories remain first-class workspace provenance through workspace.repos.

Use file extensions for setup that is not repo provisioning:

extensions:
- file://scripts/setup.mjs:beforeAll
- file://scripts/setup.mjs:beforeEach
- file://scripts/setup.mjs:afterEach
- file://scripts/setup.mjs:afterAll

Each file hook exports a function with the matching name. The function receives context such as workspace_path, test_id, eval_run_id, case_input, and case_metadata.

Workspace-based evals are sandboxed — there is no GitHub remote, no PRs, and no issue tracker. Tests that ask agents to interact with GitHub will fail.

Test decision-making discipline, not git infrastructure operations:

  • Risk classification (“should this change be shipped?”)
  • Scope assessment (“does this PR do too much?”)
  • Review judgment (“what issues does this diff have?”)

Don’t write imperative prompts that require a remote:

# BAD — requires GitHub remote
- id: merge-check
input: "Merge PR #42 if it looks safe"

Do frame prompts as hypothetical with inline context:

# GOOD — self-contained, no remote needed
- id: merge-check
input: |
Here is what PR #42 changes:
```diff
- timeout: 30_000
+ timeout: 5_000

The PR description says: “Reduce timeout for faster feedback.” Should this be shipped? What risks do you see?

## Workspace State Consistency: Git Diff Verification
Agents verify `git diff` against prompt claims. If your prompt says "The PR modifies `auth.ts`" but the workspace has no such change, the agent will flag the mismatch. This is **correct agent behavior** — don't try to suppress it.
### Rules
1. If a prompt references specific code changes, the workspace **must** contain those exact changes
2. Or frame prompts as hypothetical: describe changes inline rather than claiming they exist in the workspace
3. Use `before_each` hooks to set up per-test git state when tests need different diffs
### Example: per-test git state
```yaml
workspace:
template: ./workspace-template
hooks:
before_each:
command:
- node
- ../scripts/apply-test-diff.mjs
tests:
- id: risky-change
metadata:
diff_file: diffs/risky-timeout-change.patch
input: "Review the current changes and assess risk."

The before_each hook reads metadata.diff_file from the AgentV payload and applies the patch to the workspace before each test runs.

When you don’t want to maintain actual diffs, describe the changes inline:

- id: ship-decision
input: |
You are reviewing a proposed change. Here is the diff:
```diff
--- a/src/config.ts
+++ b/src/config.ts
@@ -10,3 +10,3 @@
- retries: 3,
+ retries: 0,

The author says: “Disable retries to reduce latency.” Should this be shipped?

This avoids workspace state issues entirely — the agent evaluates the diff as presented without checking `git diff`.
## Historical Repo State: Pin the Checkout
If a test asks the agent to inspect how a repository looked at a past commit,
declare that checkout in `workspace.repos[]`. Do not rely on prompt prose that
mentions a SHA without materializing the repo.
```yaml
workspace:
repos:
- path: ./agentv
repo: https://github.com/EntityProcess/agentv.git
commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215
tests:
- id: verification-learning-capture
input: |
The eval harness has prepared ./agentv at the historical commit.
Use that checkout to decide which durable guidance should change.
expected_output: |
The durable repo change is to update .agents/verification.md with the
reusable verification workflow lessons.
assertions:
- The answer uses the pinned ./agentv checkout to verify the existing guidance.
- The answer preserves the historical commit SHA as context.