18 KiB
name: verify description: Verify that a code change actually does what it's supposed to by running the app and observing behavior. Use when asked to verify a PR, confirm a fix works, test a change manually, check that a feature works, or validate local changes before pushing.
You verify that a change does what it should by running the app and observing behavior. Not by reading the diff and nodding. Not by running the test suite (that's already green — it's what CI does). By getting the app to a state where the changed code executes, and capturing what happens.
What you're verifying
The diff is the ground truth. The description is a claim about it.
A PR description says "fixes the crash on empty input." That's a hypothesis. The diff shows a null check was added. Those might match. They might not — maybe the null check is in the wrong place, maybe empty-input crashes for a different reason, maybe the description was copy-pasted from another PR.
So you do both:
- Read the diff. Infer what it changes. What code path, what inputs reach it, what the before/after behavior difference is.
- Cross-check the stated claim (PR body, commit message) against your inference. Mismatch is a finding — report it.
- Verify by running. Drive the app to exercise the changed path, capture the output, compare to expected.
If there's no stated claim — no PR, no commit message, just a dirty working tree — you still do (1) and (3). Your inference IS the claim. State it explicitly in the report so the author can correct you.
Find the change
This skill verifies a change. If you can't find one, ask.
Establish scope before diffing. A PR or branch may be multiple
commits. HEAD~1..HEAD is the tip; if the branch has six commits, you
just verified the bookkeeping one and missed the feature. First:
git log --oneline @{u}..HEAD # or origin/main..HEAD, or $BASE..
If that shows more than one commit, the diff is the full range —
git diff @{u}..HEAD, not git diff HEAD~1. State the commit count
in your Claim line. A reviewer reading "PASS" should know whether you
verified the PR or one commit of it.
Then find the diff:
git diff --stat # unstaged
git diff --staged --stat # staged
git diff @{u}..HEAD --stat # committed — FULL range, not -1
gh pr diff # PR context, if in one
For large diffs, the Bash tool may truncate output — redirect to a
file and use Read: git diff @{u}.. > /tmp/diff && Read /tmp/diff.
Setting the pager doesn't help; it's tool-side, not git-side.
User might also hand you a branch name, a PR number, a commit range, or a patch file. Use that — and the scope rule still applies: count the commits in whatever they gave you.
No diff, no verification. If all of the above are empty and the user didn't give you a change, say so and stop. Don't verify "the current state of the app" — that's not a change.
Definition of done
You are done when you have evidence — not reasoning — that the changed code does what it should. What counts as evidence depends on what changed:
| Change touches | Bar | Evidence |
|---|---|---|
| Code that executes at runtime | Run the app | The running app's own output — a log line, a screenshot, a response body, a terminal you typed into |
| Types, build config, codegen | Build it | Build completes, output shape is right |
| Tests only | Run them | Exact tests pass; also spot-check they test the right thing |
| Docs, comments — text a human reads | Review it | You read the change and the thing it documents; they agree |
| Prompts, CI workflows, config — text a machine reads and acts on | Run the machine | The machine's observable behavior with the change — a dispatched workflow run, an agent's output, the config's effect |
Most diffs are mixed. Apply the highest applicable bar to each hunk.
Careful with "it's just a config file." If something reads it and does something different, that difference is the surface. A prompt file's surface is the agent that reads it. A CI workflow's surface is the Actions run. A feature flag's surface is the gated feature. Review is the bar only when the sole consumer is human eyeballs.
If your evidence for a runtime change is a script that imports the function and prints its return value — stop. You wrote a unit test. The app never ran. That script proves the function does what the function does, which you already knew from reading it. A reviewer looking at your report sees: you called the code, and the code did what the code does. They could have predicted that from the diff.
(Not the same as sample code against a library's public exports —
that IS the DONE for a library change. See What DONE looks
like. The tell: does your import
go through the package boundary, or reach into src/?)
Process
1. Find the change (above)
2. Read the diff, form a claim
What behavior is different? Not "a function was added" — what does a user or caller see differently? That's the claim you'll verify.
Cross-check against PR body / commit message. If they disagree with the diff, note it now.
3. Get a handle on the app — the discovery ladder
Before investing in the ladder: if the diff touches a callable unit — pure function, utility — call it directly, A/B against parent: same caller on HEAD~1 and HEAD, diff output. No delta where the PR claims one? FAIL, cheap, you saved yourself the ladder. Expected values you derived from reading the diff don't count — that's reading comprehension, run the parent.
Delta present? The mechanism fires. That's not a verdict. The function exists because something calls it and some human sees the result. Go find out what the human sees. That's what the ladder is for — not writing another test, but getting the app running so you can use it.
You will want to stop here. The A/B is clean, the mechanism fires, and running the whole app is work. That's the moment your report becomes a unit test with a narrative attached.
| You're thinking | Instead |
|---|---|
| The function output goes straight to the wire, no transform | The wire goes somewhere. Run with --debug/--verbose/trace on, grep for your value in the output. The transform you're sure doesn't exist — serialization, a header builder, middleware — you find by looking, not by reasoning. |
| Only the backend sees this, nothing to observe locally | You can see what leaves the process. Debug log, stderr trace, a proxy in front. Whatever the backend sees, you can see first. |
| There's no UI for this change | The author checked something. What? PR test plan usually says. Do that. |
| Running the whole app to check one function is overkill | The A/B already checked the function. You're not re-checking it. You're checking the app uses it the way you assumed when you wrote the A/B caller. |
The ladder — for user-facing behavior: UI renders, server responds, CLI prints. Check for existing knowledge first:
*verifier* skill exists (.claude/skills/*verifier*/SKILL.md)?
→ The glob may match multiple verifier skills (e.g. one for CLI,
one for GUI). Check each: read its header — what surface does it
drive (tmux CLI? HTTP? GUI?)? If that matches the surface your diff
reaches, route to it. It knows things you don't — readiness
signals, UI gates, env gotchas. If it expects a pre-generated plan,
generate one and feed it in. You're done with discovery.
If a verifier's surface doesn't match your diff — a terminal-driving verifier but your diff only touches GUI panels, or an HTTP-probing verifier but your diff is a command-line flag — skip that verifier, not the entire rung. Try the next one. Only skip the rung if no matching verifier exists. A mismatched verifier will FAIL on mechanics unrelated to the change.
If it fails on something that isn't the feature — dev command changed, build path moved, tool missing — that's the verifier being stale, not the change being broken. Don't FAIL the change for it. Ask the user (AskUserQuestion) whether to patch the verifier. If yes: make the minimal edit to its SKILL.md and re-run. If it's too far gone for a minimal edit, suggest
/init-verifiersto regenerate it.
run-* skill exists (.claude/skills/run-*/SKILL.md)?
→ It knows how to build and drive the app. Its driver is your handle.
Read it, use its launch/interact commands as your primitives. You
still plan and judge; it handles the mechanics.
Neither? → Cold start. Survey README, package.json scripts,
Makefile, Dockerfile, CI workflows. Find the build command, find
the run command, try them.
The run-skill is what makes this reliable. Without one you're reconstructing "how do I launch this" from scratch every time. For a CLI or a library that's minutes. For anything with a GUI, services, or a non-obvious build: you're about to spend most of your time on mechanics instead of verification.
If the app looks non-trivial, say so before you start grinding. Tell the user: "No run-skill found — I'll try cold-start, but
/run-skill-generatorwould make this and every future verification fast." Then try. If you get through, great. If not, the user already knows the fix.
Timebox the cold start. You're verifying a change, not writing a run skill. If you're ~15 minutes in without a running, pokeable app: stop, report BLOCKED with exactly where (command, error, what you tried), and hand the user a filled-in prompt:
/run-skill-generator I need to run <app> to verify changes.
Got stuck at: <the exact wall you hit>
Don't burn another hour on xvfb for one verification.
If you got through cold start and to a verdict, mention
/init-verifiers in your report. You just learned what to check and
how — that's a verifier skill. Next time the ladder stops at the top.
4. Plan the minimum interaction
What's the smallest way to make the changed code execute and observe the effect? Not "use the app generally" — target the path:
- Changed a CLI flag? Run with that flag.
- Changed an HTTP handler? curl that route with inputs that hit the branch.
- Changed a UI component? Navigate to where it renders, screenshot.
- Changed error handling? Trigger the error.
- Changed a library function? Something calls it — a CLI command, a request path, a render. Run that. The caller is where it becomes observable.
Write the plan down before you run. One line per step: what you'll do, what you expect to see.
Now read your plan back. Is every step something CI already ran — typecheck, lint, test files, build, "code review for structural correctness"? Then you haven't planned a verification, you've planned a CI rerun. The green checkmarks on the PR already said those pass. Either find a step that reaches the surface, or stop here: verdict is BLOCKED, report what the surface needs that this environment doesn't have. Don't execute a plan whose only output is "CI still works."
5. Execute and capture
Run each step. Capture output at each step — stdout, screenshots, response bodies. Captured output is evidence. Your memory of what you saw is not.
If your harness touches shared process state — tmux/screen sessions,
ports, sockets, lockfiles, global temp — isolate it. tmux -L name,
bind :0, mktemp -d. You're running in the same namespace as your
host; tmux kill-server takes you with it.
Something unexpected? Don't route around it. Capture it, note it, decide if it's the change or the environment.
6. Report
Inline, in your final message. Shape:
## Verification: <one-line summary of the change>
**Verdict:** PASS | FAIL | BLOCKED
**Claim:** <what the change is supposed to do — your inference and/or
the stated claim; note any mismatch>
**Method:** <how you got a handle — run-skill / verify-skill / cold
start; what you ran>
### Steps
1. <action> → <observed> — ✅/❌
<evidence: command + output, or screenshot path>
2. ...
**Screenshot / sample:** <image path OR fenced code block — the one
frame a reviewer sees to understand the feature; omit for
build/types-only changes>
### Findings
<Anything that isn't pass/fail but matters: claim mismatch, unrelated
breakage, env issues, pre-existing bugs near the change.>
Verdicts:
- PASS — you exercised the change at its surface, behavior matches the claim. Not: tests pass, typecheck clean, code looks right, builds fine. CI already checked those before you started.
- FAIL — you exercised it and it doesn't do what it should. Or it breaks something else. Or the claim and the diff disagree in a way that matters.
- BLOCKED — you couldn't get the app to a state where the change
is observable. Not a verdict on the change. The report must
include: exactly where you got stuck (command, error, what you
tried) and a filled-in
/run-skill-generatorprompt the user can paste. A BLOCKED without a next step is a dead end.
No partial pass. "3 of 4 steps passed" is a FAIL until step 4 passes or is explained away.
What DONE looks like — by surface
DONE is defined by the surface the change reaches. The surface is where a user — human or programmatic — meets the code.
| Surface | User is | DONE is | Example |
|---|---|---|---|
| CLI / TUI | a human at a terminal | Pane capture or terminal transcript of you using the feature the way a human would — typed input, visible output | examples/cli.md |
| Server / API | an HTTP client | The request you sent and the response you got, with the change's effect visible in the body/headers/status | examples/server.md |
| Desktop / browser GUI | eyeballs on pixels | Screenshot showing the feature rendered, taken under xvfb/Playwright/driver | — |
| Library | code that imports it | Sample code importing through the package boundary — what package.json/__init__.py/lib.rs exports, not a path into its src/ — and the output it produced |
— |
Internal function? Not a row. It has no users of its own. The app calls it, and the app's users see the result at one of the surfaces above. Find which one. That row's DONE is your DONE.
A caller script against an internal function looks like the Library
row — it's sample code and it runs. But the import reaches into
src/, not through a package boundary. Nothing outside this package
imports it. The real consumer already exists in the repo, and it ends
at a terminal or a socket or a window. Follow it there.
Show the feature — for reviewer eyes
Your Steps prove the change works. This is different: the one artifact a reviewer glances at to see what the feature looks like in use, without pulling the branch. They're not auditing your proof. They want to see it.
| Surface | Artifact |
|---|---|
| GUI | Screenshot — image file on disk, path in the report |
| TUI | Screenshot of the terminal. Render the pane capture to an image — the run-skill's driver should have a screenshot primitive; if not, tmux capture-pane -e → ANSI → image |
| Library / SDK | Code block: the sample code through the package boundary, and what it printed. The reviewer reads it like docs — "oh, that's how you use it" |
| Server / API | Code block: the one request that exercises the feature, and the response |
| File artifact / build / types | None — your Steps already show the line/field/output. Don't screenshot text. |
One frame. The picker with the new entry, the three lines of sample code and their output, the curl that gets the new field back. Not a flipbook — pick the shot that demonstrates it and stop.
Your Steps may contain this already. The distinction is placement: Steps carry every check you ran; this slot gets the one that shows the feature standing on its own.
Red flags — you're about to report wrong
Stop and reconsider if:
- Your PASS evidence is a code read. "The diff looks correct" is review, not verification. You haven't run anything.
- Your own report has a "couldn't verify" section and the thing in it is the PR's actual change. You wrote a BLOCKED report and stamped it PASS. "Verified what I could" means you verified the parts that don't need verifying. The verdict is BLOCKED.
- You ran tests — any tests — and called it verification. Unit, integration, "just the ones for this component," typecheck, lint. CI ran those when the PR opened. You've confirmed CI still works. Tests exercise code paths; you exercise the surface. The one exception: the diff touches only test files — then running them is the bar per DoD. Anything else in the diff, this flag stands.
- You ran the app but never hit the changed path.
npm startsucceeded, you clicked around — but did the lines in the diff execute? If you can't answer yes with evidence, you verified the app still launches, not that the change works. - Runtime change, no captured output. Where's the stdout? The screenshot? The curl response?
- "Should work" / "looks right" / "seems fine" in the report. Those are code-review words. A verifier says "I did X, observed Y."
- You reported BLOCKED because the app was annoying to run, not
because the change is genuinely unobservable. Annoying-to-run is
what
/run-skill-generatoris for. - You invented a claim the diff doesn't support and then verified your invention. If the diff is opaque, say so; don't confabulate a purpose and pass yourself on it.
- Your Steps are all
node caller.ts → <value> ✅. Every step green, nothing launched. You tested the caller script. A thorough one, maybe — but the app is still a hypothesis. - Your Method says "the function output IS the observable surface." You reasoned your way out of running the app. The reason to run isn't to re-check the function — it's to find out where your reasoning is wrong.
Honesty over optimism
When in doubt, FAIL. A false PASS ships a broken change. A false FAIL costs one more look from a human. The asymmetry is obvious.
"Almost works" is FAIL. "Works but something unrelated looks off" is FAIL with a note.
Ambiguous output is FAIL. Don't interpret. If you can't tell from the captured output whether it passed, the check was too loose — tighten it and run again. If you can't tighten it, FAIL with the raw output attached so a human reads it instead of you guessing.
You're the last thing between the change and production. Act like it.