2026-05-15 18:45:28 -06:00

10 KiB


name: verify description: Verify that a code change actually does what it's supposed to by running the app and observing behavior. Use when asked to verify a PR, confirm a fix works, test a change manually, check that a feature works, or validate local changes before pushing.

Verification is runtime observation. You build the app, run it, drive it to where the changed code executes, and capture what you see. That capture is your evidence. Nothing else is.

Don't run tests. Don't typecheck. Running them here proves you can run CI — not that the change works. Not as a warm-up, not "just to be sure," not as a regression sweep after. The time goes to running the app instead.

Don't import-and-call. import { foo } from './src/...' then console.log(foo(x)) is a unit test you wrote. The function did what the function does — you knew that from reading it. The app never ran. Whatever calls foo in the real codebase ends at a CLI, a socket, or a window. Go there.

Find the change

The scope is what you're verifying — usually a diff, sometimes just "does X work." In a git repo, establish the full range (a branch may be many commits, or the change may still be uncommitted):

git log --oneline @{u}..              # count commits (if upstream set)
git diff @{u}.. --stat                # full range, not HEAD~1
git diff origin/HEAD... --stat        # no upstream: committed vs base
git diff HEAD --stat                  # uncommitted: working tree vs HEAD
gh pr diff                            # if in a PR context

State the commit count. Large diff truncating? Redirect to a file then Read it. Repo but no diff from any of these → say so, stop. No repo → the scope is whatever the user named; ask if they didn't.

The diff is ground truth. Any description is a claim about it. Read both. If they disagree, that's a finding.

Surface

The surface is where a user — human or programmatic — meets the change. That's where you observe.

Change reaches Surface You
CLI / TUI terminal type the command, capture the pane — example
Server / API socket send the request, capture the response — example
GUI pixels drive it under xvfb/Playwright, screenshot
Library package boundary sample code through the public export — import pkg, not import ./src/...
Prompt / agent config the agent run the agent, capture its behavior
CI workflow Actions dispatch it, read the run

Internal function? Not a surface. Something in the repo calls it and that caller ends at one of the rows above. Follow it there. A bash security gate's surface isn't the function's return value — it's the CLI prompting or auto-allowing when you type the command.

No runtime surface at all — docs-only, type declarations with no emit, build config that produces no behavioral diff — report SKIP — no runtime surface: (reason). Don't run tests to fill the space.

Tests in the diff are the author's evidence, not a surface. CI runs them. You'd be re-running CI. Tests-only PR → SKIP, one line. Mixed src+tests → verify the src, ignore the test files. Reading a test to learn what to check is fine — it's a spec. But then go run the app. Checking that assertions match source is code review.

Get a handle

Check .claude/skills/ first — even if you already know how to build and run. A matching verifier-* skill is the repo's evidence-capture protocol: it wraps the session so a reviewer can replay what you saw (recording, screenshots). Drive the surface without it and you get a verdict with no replay.

ls .claude/skills/
  • verifier-* matching your surface (CLI verifier for a CLI change, etc.) → invoke it with the Skill tool and follow its setup. Mismatched surface → skip that one, try the next. Stale verifier (fails on mechanics unrelated to the change) → ask the user whether to patch it; don't FAIL the change for verifier rot.
  • run-* but no matching verifier → use its build/launch primitives as your handle.
  • Neither → cold start from README/package.json/Makefile. Timebox ~15min. Stuck → BLOCKED with exactly where, plus a filled-in /run-skill-generator prompt. Got through → note the working build/launch recipe so it can become a verifier-* skill.

Drive it

Smallest path that makes the changed code execute:

  • Changed a flag? Run with it.
  • Changed a handler? Hit that route.
  • Changed error handling? Trigger the error.
  • Changed an internal function? Find the CLI command / request / render that reaches it. Run that.

Read your plan back before running. If every step is build / typecheck / run test file — you've planned a CI rerun, not a verification. Find a step that reaches the surface or report BLOCKED.

The verdict is table stakes. Your observations are the signal. A PASS with three sharp "hey, I noticed…" lines is worth more than a bare PASS. You're the only reviewer who actually ran the thing — anything that made you pause, work around, or go "huh" is information the author doesn't have. Don't filter for "is this a bug." Filter for "would I mention this if they were sitting next to me."

End-to-end, through the real interface. Pieces passing in isolation doesn't mean the flow works — seams are where bugs hide. If users click buttons, test by clicking buttons, not by curling the API underneath.

Destructive path? If the change touches code that deletes, publishes, sends, or writes outside the workspace and there's no dry-run or safe target, don't drive it live. Verify what you can around it and say which path you didn't exercise and why.

Push on it

The claim checked out — that's the first half. Confirming is step one, not the job. The description is what the author intended; your value is what they didn't.

You know exactly what changed. Probe around it, at the same surface you just drove:

  • New flag / option → empty value, passed twice, combined with a conflicting flag, typo'd (does the error name it?)
  • New handler / route → wrong method, malformed body, missing required field, oversized payload
  • Changed error path → the adjacent errors it didn't touch — did the refactor catch them too, or only the one in the diff?
  • Interactive / TUI → Ctrl-C mid-op, resize the pane, paste garbage, rapid-fire the key, Esc at the wrong moment
  • State / persistence → do it twice, do it with stale state underneath, do it in two sessions at once
  • Wander → what's adjacent? What looked off while you were confirming? Go back to it.

These aren't a checklist — pick the ones the change points at. Stop when you've covered the obvious adjacents or hit something worth a ⚠️. A probe that finds nothing is still a step: "🔍 passed --from '' → clean error: --from requires a value, exit 2." That the author didn't test it is exactly why it's worth knowing it holds.

Still not a test run. You're at the surface, typing what a user would type wrong.

Capture

Stdout, response bodies, screenshots, pane dumps. Captured output is evidence; your memory isn't. Something unexpected? Don't route around it — capture, note, decide if it's the change or the environment. Unrelated breakage is a finding, not noise.

Shared process state (tmux, ports, lockfiles) — isolate. tmux -L name, bind :0, mktemp -d. You share a namespace with your host.

Report

Inline, final message:

## Verification: <one-line what changed>

**Verdict:** PASS | FAIL | BLOCKED | SKIP

**Claim:** <what it's supposed to do — your read of the diff and/or
the stated claim; note any mismatch>

**Method:** <how you got a handle — which verifier/run-skill, or
cold start; what you launched>

### Steps

Each step is one thing you did to the **running app** and what it
showed. Build/install/checkout are setup, not steps. Test runs and
typecheck don't belong here — they're CI's output.

1. ✅/❌/⚠️/🔍 <what you did to the running app> → <what you observed>
   <evidence: the app's own output — pane capture, response body,
   screenshot path>

🔍 marks a probe — a step off the claim's happy path, trying to
break it. At least one. A Steps list that's all ✅ and no 🔍 is a
happy-path replay: still PASS, but you stopped at the first half.

**Screenshot / sample:** <the one frame a reviewer looks at to see
the feature — image path for GUI/TUI, code block for library/API;
omit for build/types-only>

### Findings
<Things you noticed. Not just bugs — friction, surprises, anything
a first-time user would trip on. "Took three tries to find the right
flag." "Error message on typo was unhelpful." "Default seems odd for
the common case." "Works, but slower than I expected." Lower the bar:
if it made you pause, it goes here. But the pause has to be yours,
from running the app — not from reading the PR page. A red CI check,
a review comment, someone else's bot: visible to anyone already, and
you relaying it isn't an observation. Claim/diff mismatch, pre-existing
breakage, and env notes also belong.

Each probe gets a line here even when it held — "🔍 empty `--from`
→ clean error" tells the author what *was* covered, which they
can't see from a bare PASS.

Lead with ⚠️ for lines worth interrupting the reviewer for; plain
bullets are context. Empty is fine if nothing stuck out — but nothing
sticking out is itself rare.>

Verdicts:

  • PASS — you ran the app, the change did what it should at its surface. Not: tests pass, builds clean, code looks right.
  • FAIL — you ran it and it doesn't. Or it breaks something else. Or claim and diff disagree materially.
  • BLOCKED — couldn't reach a state where the change is observable. Build broke, env missing a dep, handle wouldn't come up. Not a verdict on the change. Say exactly where it stopped + /run-skill-generator prompt.
  • SKIP — no runtime surface exists. Docs-only, types-only, tests-only. Nothing went wrong; there's just nothing here to run. One line why.

No partial pass. "3 of 4 passed" is FAIL until 4 passes or is explained away.

When in doubt, FAIL. False PASS ships broken code; false FAIL costs one more human look. Ambiguous output is FAIL with the raw capture attached — don't interpret.