enhance: enforce mandatory per-task QA scenarios and add Final Verification Wave

Strengthen TODO template to make QA scenarios non-optional with explicit
rejection warning. Add Final Verification Wave with 4 parallel review
agents: oracle (plan compliance audit), unspecified-high (code quality),
unspecified-high (real manual QA), deep (scope fidelity check) — each
with detailed verification steps and structured output format.
This commit is contained in:
YeonGyu-Kim 2026-02-16 15:19:31 +09:00
parent 7e6982c8d8
commit 130aaaf910

View File

@ -216,7 +216,13 @@ Wave 4 (After Wave 3 — verification):
Task 23: E2E QA (depends: 21) [deep]
Task 24: Git cleanup + tagging (depends: 21) [git]
Critical Path: Task 1 Task 5 Task 8 Task 11 Task 15 Task 21
Wave FINAL (After ALL tasks independent review, 4 parallel):
Task F1: Plan compliance audit (oracle)
Task F2: Code quality review (unspecified-high)
Task F3: Real manual QA (unspecified-high)
Task F4: Scope fidelity check (deep)
Critical Path: Task 1 Task 5 Task 8 Task 11 Task 15 Task 21 F1-F4
Parallel Speedup: ~70% faster than sequential
Max Concurrent: 7 (Waves 1 & 2)
\`\`\`
@ -242,13 +248,15 @@ Max Concurrent: 7 (Waves 1 & 2)
| 2 | **7** | T8 \`deep\`, T9 → \`unspecified-high\`, T10 → \`unspecified-high\`, T11 → \`deep\`, T12 → \`visual-engineering\`, T13 → \`quick\`, T14 → \`unspecified-high\` |
| 3 | **6** | T15 \`deep\`, T16 → \`visual-engineering\`, T17-T19 → \`quick\`, T20 → \`visual-engineering\` |
| 4 | **4** | T21 \`deep\`, T22 → \`unspecified-high\`, T23 → \`deep\`, T24 → \`git\` |
| FINAL | **4** | F1 \`oracle\`, F2 → \`unspecified-high\`, F3 → \`unspecified-high\`, F4 → \`deep\` |
---
## TODOs
> Implementation + Test = ONE Task. Never separate.
> EVERY task MUST have: Recommended Agent Profile + Parallelization info.
> EVERY task MUST have: Recommended Agent Profile + Parallelization info + QA Scenarios.
> **A task WITHOUT QA Scenarios is INCOMPLETE. No exceptions.**
- [ ] 1. [Task Title]
@ -282,22 +290,15 @@ Max Concurrent: 7 (Waves 1 & 2)
**Pattern References** (existing code to follow):
- \`src/services/auth.ts:45-78\` - Authentication flow pattern (JWT creation, refresh token handling)
- \`src/hooks/useForm.ts:12-34\` - Form validation pattern (Zod schema + react-hook-form integration)
**API/Type References** (contracts to implement against):
- \`src/types/user.ts:UserDTO\` - Response shape for user endpoints
- \`src/api/schema.ts:createUserSchema\` - Request validation schema
**Test References** (testing patterns to follow):
- \`src/__tests__/auth.test.ts:describe("login")\` - Test structure and mocking patterns
**Documentation References** (specs and requirements):
- \`docs/api-spec.md#authentication\` - API contract details
- \`ARCHITECTURE.md:Database Layer\` - Database access patterns
**External References** (libraries and frameworks):
- Official docs: \`https://zod.dev/?id=basic-usage\` - Zod validation syntax
- Example repo: \`github.com/example/project/src/auth\` - Reference implementation
**WHY Each Reference Matters** (explain the relevance):
- Don't just list files - explain what pattern/information the executor should extract
@ -308,113 +309,53 @@ Max Concurrent: 7 (Waves 1 & 2)
> **AGENT-EXECUTABLE VERIFICATION ONLY** No human action permitted.
> Every criterion MUST be verifiable by running a command or using a tool.
> REPLACE all placeholders with actual values from task context.
**If TDD (tests enabled):**
- [ ] Test file created: src/auth/login.test.ts
- [ ] Test covers: successful login returns JWT token
- [ ] bun test src/auth/login.test.ts PASS (3 tests, 0 failures)
**Agent-Executed QA Scenarios (MANDATORY per-scenario, ultra-detailed):**
**QA Scenarios (MANDATORY task is INCOMPLETE without these):**
> Write MULTIPLE named scenarios per task: happy path AND failure cases.
> Each scenario = exact tool + steps with real selectors/data + evidence path.
**Example Frontend/UI (Playwright):**
> **This is NOT optional. A task without QA scenarios WILL BE REJECTED.**
>
> Write scenario tests that verify the ACTUAL BEHAVIOR of what you built.
> Minimum: 1 happy path + 1 failure/edge case per task.
> Each scenario = exact tool + exact steps + exact assertions + evidence path.
>
> **The executing agent MUST run these scenarios after implementation.**
> **The orchestrator WILL verify evidence files exist before marking task complete.**
\\\`\\\`\\\`
Scenario: Successful login redirects to dashboard
Tool: Playwright (playwright skill)
Preconditions: Dev server running on localhost:3000, test user exists
Scenario: [Happy path what SHOULD work]
Tool: [Playwright / interactive_bash / Bash (curl)]
Preconditions: [Exact setup state]
Steps:
1. Navigate to: http://localhost:3000/login
2. Wait for: input[name="email"] visible (timeout: 5s)
3. Fill: input[name="email"] "test@example.com"
4. Fill: input[name="password"] "ValidPass123!"
5. Click: button[type="submit"]
6. Wait for: navigation to /dashboard (timeout: 10s)
7. Assert: h1 text contains "Welcome back"
8. Assert: cookie "session_token" exists
9. Screenshot: .sisyphus/evidence/task-1-login-success.png
Expected Result: Dashboard loads with welcome message
Evidence: .sisyphus/evidence/task-1-login-success.png
1. [Exact action specific command/selector/endpoint, no vagueness]
2. [Next action with expected intermediate state]
3. [Assertion exact expected value, not "verify it works"]
Expected Result: [Concrete, observable, binary pass/fail]
Failure Indicators: [What specifically would mean this failed]
Evidence: .sisyphus/evidence/task-{N}-{scenario-slug}.{ext}
Scenario: Login fails with invalid credentials
Tool: Playwright (playwright skill)
Preconditions: Dev server running, no valid user with these credentials
Scenario: [Failure/edge case what SHOULD fail gracefully]
Tool: [same format]
Preconditions: [Invalid input / missing dependency / error state]
Steps:
1. Navigate to: http://localhost:3000/login
2. Fill: input[name="email"] "wrong@example.com"
3. Fill: input[name="password"] "WrongPass"
4. Click: button[type="submit"]
5. Wait for: .error-message visible (timeout: 5s)
6. Assert: .error-message text contains "Invalid credentials"
7. Assert: URL is still /login (no redirect)
8. Screenshot: .sisyphus/evidence/task-1-login-failure.png
Expected Result: Error message shown, stays on login page
Evidence: .sisyphus/evidence/task-1-login-failure.png
1. [Trigger the error condition]
2. [Assert error is handled correctly]
Expected Result: [Graceful failure with correct error message/code]
Evidence: .sisyphus/evidence/task-{N}-{scenario-slug}-error.{ext}
\\\`\\\`\\\`
**Example API/Backend (curl):**
\\\`\\\`\\\`
Scenario: Create user returns 201 with UUID
Tool: Bash (curl)
Preconditions: Server running on localhost:8080
Steps:
1. curl -s -w "\\n%{http_code}" -X POST http://localhost:8080/api/users \\
-H "Content-Type: application/json" \\
-d '{"email":"new@test.com","name":"Test User"}'
2. Assert: HTTP status is 201
3. Assert: response.id matches UUID format
4. GET /api/users/{returned-id} Assert name equals "Test User"
Expected Result: User created and retrievable
Evidence: Response bodies captured
Scenario: Duplicate email returns 409
Tool: Bash (curl)
Preconditions: User with email "new@test.com" already exists
Steps:
1. Repeat POST with same email
2. Assert: HTTP status is 409
3. Assert: response.error contains "already exists"
Expected Result: Conflict error returned
Evidence: Response body captured
\\\`\\\`\\\`
**Example TUI/CLI (interactive_bash):**
\\\`\\\`\\\`
Scenario: CLI loads config and displays menu
Tool: interactive_bash (tmux)
Preconditions: Binary built, test config at ./test.yaml
Steps:
1. tmux new-session: ./my-cli --config test.yaml
2. Wait for: "Configuration loaded" in output (timeout: 5s)
3. Assert: Menu items visible ("1. Create", "2. List", "3. Exit")
4. Send keys: "3" then Enter
5. Assert: "Goodbye" in output
6. Assert: Process exited with code 0
Expected Result: CLI starts, shows menu, exits cleanly
Evidence: Terminal output captured
Scenario: CLI handles missing config gracefully
Tool: interactive_bash (tmux)
Preconditions: No config file at ./nonexistent.yaml
Steps:
1. tmux new-session: ./my-cli --config nonexistent.yaml
2. Wait for: output (timeout: 3s)
3. Assert: stderr contains "Config file not found"
4. Assert: Process exited with code 1
Expected Result: Meaningful error, non-zero exit
Evidence: Error output captured
\\\`\\\`\\\`
> **Anti-patterns (your scenario is INVALID if it looks like this):**
> - "Verify it works correctly" HOW? What does "correctly" mean?
> - "Check the API returns data" WHAT data? What fields? What values?
> - "Test the component renders" WHERE? What selector? What content?
> - Any scenario without an evidence path
**Evidence to Capture:**
- [ ] Screenshots in .sisyphus/evidence/ for UI scenarios
- [ ] Terminal output for CLI/TUI scenarios
- [ ] Response bodies for API scenarios
- [ ] Each evidence file named: task-{N}-{scenario-slug}.{ext}
- [ ] Screenshots for UI, terminal output for CLI, response bodies for API
**Commit**: YES | NO (groups with N)
- Message: \`type(scope): desc\`
@ -423,6 +364,158 @@ Max Concurrent: 7 (Waves 1 & 2)
---
## Final Verification Wave (MANDATORY after ALL implementation tasks)
> **ALL 4 review agents run in PARALLEL after every implementation task is complete.**
> **ALL 4 must APPROVE before the plan is considered done.**
> **If ANY agent rejects, fix issues and re-run the rejecting agent(s).**
- [ ] F1. Plan Compliance Audit
**Agent**: oracle (read-only consultation)
**What this agent does**:
Read the original work plan (.sisyphus/plans/{name}.md) and verify EVERY requirement was fulfilled.
**Exact verification steps**:
1. Read the plan file end-to-end
2. For EACH item in "Must Have": verify the implementation exists and works
- Run the verification command listed in "Definition of Done"
- Check the file/endpoint/feature actually exists (read the file, curl the endpoint)
3. For EACH item in "Must NOT Have": verify it was NOT implemented
- Search codebase for forbidden patterns (grep, ast_grep_search)
- If found REJECT with specific file:line reference
4. For EACH TODO task: verify acceptance criteria were met
- Check evidence files exist in .sisyphus/evidence/
- Verify test results match expected outcomes
5. Compare final deliverables against "Concrete Deliverables" list
**Output format**:
\\\`\\\`\\\`
## Plan Compliance Report
### Must Have: [N/N passed]
- [/] [requirement]: [evidence]
### Must NOT Have: [N/N clean]
- [/] [guardrail]: [evidence]
### Task Completion: [N/N verified]
- [/] Task N: [criteria status]
### VERDICT: APPROVE / REJECT
### Rejection Reasons (if any): [specific issues]
\\\`\\\`\\\`
- [ ] F2. Code Quality Review
**Agent**: unspecified-high
**What this agent does**:
Review ALL changed/created files for production readiness. This is NOT a rubber stamp.
**Exact verification steps**:
1. Run full type check: \`bunx tsc --noEmit\` (or project equivalent) → must exit 0
2. Run linter if configured: \`bunx biome check .\` / \`bunx eslint .\` → must pass
3. Run full test suite: \`bun test\` → all tests pass, zero failures
4. For EACH new/modified file, check:
- No \`as any\`, \`@ts-ignore\`, \`@ts-expect-error\`
- No empty catch blocks \`catch(e) {}\`
- No console.log left in production code (unless intentional logging)
- No commented-out code blocks
- No TODO/FIXME/HACK comments without linked issue
- Consistent naming with existing codebase conventions
- Imports are clean (no unused imports)
5. Check for AI slop patterns:
- Excessive inline comments explaining obvious code
- Over-abstraction (unnecessary wrapper functions)
- Generic variable names (data, result, item, temp)
**Output format**:
\\\`\\\`\\\`
## Code Quality Report
### Build: [PASS/FAIL] tsc exit code, error count
### Lint: [PASS/FAIL] linter output summary
### Tests: [PASS/FAIL] N passed, N failed, N skipped
### File Review: [N files reviewed]
- [file]: [issues found or "clean"]
### AI Slop Check: [N issues]
- [file:line]: [pattern detected]
### VERDICT: APPROVE / REJECT
\\\`\\\`\\\`
- [ ] F3. Real Manual QA
**Agent**: unspecified-high (with \`playwright\` skill if UI involved)
**What this agent does**:
Actually RUN the deliverable end-to-end as a real user would. No mocks, no shortcuts.
**Exact verification steps**:
1. Start the application/service from scratch (clean state)
2. Execute EVERY QA scenario from EVERY task in the plan sequentially:
- Follow the exact steps written in each task's QA Scenarios section
- Capture evidence (screenshots, terminal output, response bodies)
- Compare actual behavior against expected results
3. Test cross-task integration:
- Does feature A work correctly WITH feature B? (not just in isolation)
- Does the full user flow work end-to-end?
4. Test edge cases not covered by individual tasks:
- Empty state / first-time use
- Rapid repeated actions
- Invalid/malformed input
- Network interruption (if applicable)
5. Save ALL evidence to .sisyphus/evidence/final-qa/
**Output format**:
\\\`\\\`\\\`
## Manual QA Report
### Scenarios Executed: [N/N passed]
- [/] Task N - Scenario name: [result]
### Integration Tests: [N/N passed]
- [/] [flow name]: [result]
### Edge Cases: [N tested]
- [/] [case]: [result]
### Evidence: .sisyphus/evidence/final-qa/
### VERDICT: APPROVE / REJECT
\\\`\\\`\\\`
- [ ] F4. Scope Fidelity Check
**Agent**: deep
**What this agent does**:
Verify that EACH task implemented EXACTLY what was specified no more, no less.
Catches scope creep, missing features, and unauthorized additions.
**Exact verification steps**:
1. For EACH completed task in the plan:
a. Read the task's "What to do" section
b. Read the actual diff/files created for that task (git log, git diff, file reads)
c. Verify 1:1 correspondence:
- Everything in "What to do" was implemented no missing features
- Nothing BEYOND "What to do" was implemented no scope creep
d. Read the task's "Must NOT do" section
e. Verify NONE of the forbidden items were implemented
2. Check for unauthorized cross-task contamination:
- Did Task 5 accidentally implement something that belongs to Task 8?
- Are there files modified that don't belong to any task?
3. Verify each task's boundaries are respected:
- No task touches files outside its stated scope
- No task implements functionality assigned to a different task
**Output format**:
\\\`\\\`\\\`
## Scope Fidelity Report
### Task-by-Task Audit: [N/N compliant]
- [/] Task N: [compliance status]
- Implemented: [list of what was done]
- Missing: [anything from "What to do" not found]
- Excess: [anything done that wasn't in "What to do"]
- "Must NOT do" violations: [list or "none"]
### Cross-Task Contamination: [CLEAN / N issues]
### Unaccounted Changes: [CLEAN / N files]
### VERDICT: APPROVE / REJECT
\\\`\\\`\\\`
---
## Commit Strategy
| After Task | Message | Files | Verification |