2026-03-30 19:05:50 -06:00

6.6 KiB


name: verify description: Verify that a code change actually does what it's supposed to by running the app and observing behavior. Use when asked to verify a PR, confirm a fix works, test a change manually, check that a feature works, or validate local changes before pushing.

Verification is runtime observation. You build the app, run it, drive it to where the changed code executes, and capture what you see. That capture is your evidence. Nothing else is.

Don't run tests. Don't typecheck. CI ran both before you got here — green checks on the PR mean they passed. Running them again proves you can run CI. Not as a warm-up, not "just to be sure," not as a regression sweep after. The time goes to running the app instead.

Don't import-and-call. import { foo } from './src/...' then console.log(foo(x)) is a unit test you wrote. The function did what the function does — you knew that from reading it. The app never ran. Whatever calls foo in the real codebase ends at a CLI, a socket, or a window. Go there.

Find the change

Establish the full range first — a branch may be many commits:

git log --oneline @{u}..              # count commits
git diff @{u}.. --stat                # full range, not HEAD~1
gh pr diff                            # if in a PR context

State the commit count in your report. Large diff truncating? Redirect: git diff @{u}.. > /tmp/d then Read it. No diff at all → say so, stop.

The diff is ground truth. The PR description is a claim about it. Read both. If they disagree, that's a finding.

Surface

The surface is where a user — human or programmatic — meets the change. That's where you observe.

Change reaches Surface You
CLI / TUI terminal type the command, capture the pane — example
Server / API socket send the request, capture the response — example
GUI pixels drive it under xvfb/Playwright, screenshot
Library package boundary sample code through the public export — import pkg, not import ./src/...
Prompt / agent config the agent run the agent, capture its behavior
CI workflow Actions dispatch it, read the run

Internal function? Not a surface. Something in the repo calls it and that caller ends at one of the rows above. Follow it there. A bash security gate's surface isn't the function's return value — it's the CLI prompting or auto-allowing when you type the command.

No runtime surface at all — docs-only, type declarations with no emit, build config that produces no behavioral diff — report BLOCKED — no runtime surface: (reason). Don't run tests to fill the space.

Get a handle

Check for existing knowledge before cold-starting:

  • .claude/skills/*verifier*/ — if one matches your surface (CLI verifier for a CLI change, etc.), route to it. It knows readiness signals and env gotchas you don't. Mismatched surface → skip that one, try the next. Stale verifier (fails on mechanics unrelated to the change) → ask the user whether to patch it; don't FAIL the change for verifier rot.
  • .claude/skills/run-*/ — knows how to build and launch. Use its primitives as your handle.
  • Neither — cold start from README/package.json/Makefile. Timebox ~15min. Stuck → BLOCKED with exactly where, plus a filled-in /run-skill-generator prompt. Got through → mention /init-verifiers in your report so next time is faster.

Drive it

Smallest path that makes the changed code execute:

  • Changed a flag? Run with it.
  • Changed a handler? Hit that route.
  • Changed error handling? Trigger the error.
  • Changed an internal function? Find the CLI command / request / render that reaches it. Run that.

Read your plan back before running. If every step is build / typecheck / run test file — you've planned a CI rerun, not a verification. Find a step that reaches the surface or report BLOCKED.

Once the claim checks out, keep going: break it (empty input, huge input, interrupt mid-op), combine it (new thing + old thing), wander (what's adjacent? what looked off?). The PR description is what the author intended. Your job includes what they didn't.

End-to-end, through the real interface. Pieces passing in isolation doesn't mean the flow works — seams are where bugs hide. If users click buttons, test by clicking buttons, not by curling the API underneath.

Capture

Stdout, response bodies, screenshots, pane dumps. Captured output is evidence; your memory isn't. Something unexpected? Don't route around it — capture, note, decide if it's the change or the environment. Unrelated breakage is a finding, not noise.

Shared process state (tmux, ports, lockfiles) — isolate. tmux -L name, bind :0, mktemp -d. You share a namespace with your host.

Report

Inline, final message:

## Verification: <one-line what changed>

**Verdict:** PASS | FAIL | BLOCKED

**Claim:** <what it's supposed to do — your read of the diff and/or
the stated claim; note any mismatch>

**Method:** <how you got a handle — which verifier/run-skill, or
cold start; what you launched>

### Steps

Each step is one thing you did to the **running app** and what it
showed. Build/install/checkout are setup, not steps. Test runs and
typecheck don't belong here — they're CI's output.

1. <what you did to the running app> → <what you observed> — ✅/❌
   <evidence: the app's own output — pane capture, response body,
   screenshot path>
2. ...

**Screenshot / sample:** <the one frame a reviewer looks at to see
the feature — image path for GUI/TUI, code block for library/API;
omit for build/types-only>

### Findings
<Claim mismatch, unrelated breakage, env notes, pre-existing bugs
near the change.>

Verdicts:

  • PASS — you ran the app, the change did what it should at its surface. Not: tests pass, builds clean, code looks right.
  • FAIL — you ran it and it doesn't. Or it breaks something else. Or claim and diff disagree materially.
  • BLOCKED — couldn't reach a state where the change is observable, or no runtime surface exists. Not a verdict on the change. Env blocker → say exactly where + /run-skill-generator prompt. No surface → one line why.

No partial pass. "3 of 4 passed" is FAIL until 4 passes or is explained away.

When in doubt, FAIL. False PASS ships broken code; false FAIL costs one more human look. Ambiguous output is FAIL with the raw capture attached — don't interpret.