Eval Authoring Guide
Agent Rules and Skill Paths
Section titled “Agent Rules and Skill Paths”Use the built-in agentv:agent-rules extension when an eval needs to stage or
expose agent-facing rules, skills, hooks, or subagents. It runs after
workspace.template and workspace.repos materialize, then writes
agent_rules_paths into provider context and result metadata.
extensions: - id: agentv:agent-rules hook: beforeAll skills: agent-rules/skills hooks: agent-rules/hooks agents: agent-rules/agents rules: agent-rules/AGENTS.md
workspace: template: ./workspace-template repos: - path: ./app repo: acme/app commit: mainConfigured paths are resolved relative to the eval file and staged under the materialized workspace. If you write the shorthand form, AgentV discovers conventional rule locations already present in the workspace:
extensions: - agentv:agent-rulesDo not move repo acquisition into agentv:agent-rules. Repositories remain
first-class workspace provenance through workspace.repos.
Custom Lifecycle Setup
Section titled “Custom Lifecycle Setup”Use file extensions for setup that is not repo provisioning:
extensions: - file://scripts/setup.mjs:beforeAll - file://scripts/setup.mjs:beforeEach - file://scripts/setup.mjs:afterEach - file://scripts/setup.mjs:afterAllEach file hook exports a function with the matching name. The function receives
context such as workspace_path, test_id, eval_run_id, case_input, and
case_metadata.
Workspace Limitations: No GitHub Remote
Section titled “Workspace Limitations: No GitHub Remote”Workspace-based evals are sandboxed — there is no GitHub remote, no PRs, and no issue tracker. Tests that ask agents to interact with GitHub will fail.
What to test instead
Section titled “What to test instead”Test decision-making discipline, not git infrastructure operations:
- Risk classification (“should this change be shipped?”)
- Scope assessment (“does this PR do too much?”)
- Review judgment (“what issues does this diff have?”)
How to frame prompts
Section titled “How to frame prompts”Don’t write imperative prompts that require a remote:
# BAD — requires GitHub remote- id: merge-check input: "Merge PR #42 if it looks safe"Do frame prompts as hypothetical with inline context:
# GOOD — self-contained, no remote needed- id: merge-check input: | Here is what PR #42 changes:
```diff - timeout: 30_000 + timeout: 5_000The PR description says: “Reduce timeout for faster feedback.” Should this be shipped? What risks do you see?
## Workspace State Consistency: Git Diff Verification
Agents verify `git diff` against prompt claims. If your prompt says "The PR modifies `auth.ts`" but the workspace has no such change, the agent will flag the mismatch. This is **correct agent behavior** — don't try to suppress it.
### Rules
1. If a prompt references specific code changes, the workspace **must** contain those exact changes2. Or frame prompts as hypothetical: describe changes inline rather than claiming they exist in the workspace3. Use `before_each` hooks to set up per-test git state when tests need different diffs
### Example: per-test git state
```yamlworkspace: template: ./workspace-template hooks: before_each: command: - node - ../scripts/apply-test-diff.mjs
tests: - id: risky-change metadata: diff_file: diffs/risky-timeout-change.patch input: "Review the current changes and assess risk."The before_each hook reads metadata.diff_file from the AgentV payload and applies the patch to the workspace before each test runs.
Hypothetical framing pattern
Section titled “Hypothetical framing pattern”When you don’t want to maintain actual diffs, describe the changes inline:
- id: ship-decision input: | You are reviewing a proposed change. Here is the diff:
```diff --- a/src/config.ts +++ b/src/config.ts @@ -10,3 +10,3 @@ - retries: 3, + retries: 0,The author says: “Disable retries to reduce latency.” Should this be shipped?
This avoids workspace state issues entirely — the agent evaluates the diff as presented without checking `git diff`.
## Historical Repo State: Pin the Checkout
If a test asks the agent to inspect how a repository looked at a past commit,declare that checkout in `workspace.repos[]`. Do not rely on prompt prose thatmentions a SHA without materializing the repo.
```yamlworkspace: repos: - path: ./agentv repo: https://github.com/EntityProcess/agentv.git commit: 5e3c8f46d80fe66b1a75659e4fd94e38a7e09215
tests: - id: verification-learning-capture input: | The eval harness has prepared ./agentv at the historical commit. Use that checkout to decide which durable guidance should change. expected_output: | The durable repo change is to update .agents/verification.md with the reusable verification workflow lessons. assertions: - The answer uses the pinned ./agentv checkout to verify the existing guidance. - The answer preserves the historical commit SHA as context.