enhance: enforce mandatory per-task QA scenarios and add Final Verification Wave

Strengthen TODO template to make QA scenarios non-optional with explicit rejection warning. Add Final Verification Wave with 4 parallel review agents: oracle (plan compliance audit), unspecified-high (code quality), unspecified-high (real manual QA), deep (scope fidelity check) — each with detailed verification steps and structured output format.
2026-02-16 15:19:31 +09:00 · 2026-02-16 15:19:31 +09:00 · 130aaaf910
commit 130aaaf910
parent 7e6982c8d8
1 changed files with 193 additions and 100 deletions
--- a/src/agents/prometheus/plan-template.ts
+++ b/src/agents/prometheus/plan-template.ts
@ -216,7 +216,13 @@ Wave 4 (After Wave 3 — verification):
 ├── Task 23: E2E QA (depends: 21) [deep]
 └── Task 24: Git cleanup + tagging (depends: 21) [git]

-Critical Path: Task 1 → Task 5 → Task 8 → Task 11 → Task 15 → Task 21
+Wave FINAL (After ALL tasks — independent review, 4 parallel):
+├── Task F1: Plan compliance audit (oracle)
+├── Task F2: Code quality review (unspecified-high)
+├── Task F3: Real manual QA (unspecified-high)
+└── Task F4: Scope fidelity check (deep)
+
+Critical Path: Task 1 → Task 5 → Task 8 → Task 11 → Task 15 → Task 21 → F1-F4
 Parallel Speedup: ~70% faster than sequential
 Max Concurrent: 7 (Waves 1 & 2)
 \`\`\`
@ -242,13 +248,15 @@ Max Concurrent: 7 (Waves 1 & 2)
 | 2 | **7** | T8 → \`deep\`, T9 → \`unspecified-high\`, T10 → \`unspecified-high\`, T11 → \`deep\`, T12 → \`visual-engineering\`, T13 → \`quick\`, T14 → \`unspecified-high\` |
 | 3 | **6** | T15 → \`deep\`, T16 → \`visual-engineering\`, T17-T19 → \`quick\`, T20 → \`visual-engineering\` |
 | 4 | **4** | T21 → \`deep\`, T22 → \`unspecified-high\`, T23 → \`deep\`, T24 → \`git\` |
+| FINAL | **4** | F1 → \`oracle\`, F2 → \`unspecified-high\`, F3 → \`unspecified-high\`, F4 → \`deep\` |

 ---

 ## TODOs

 > Implementation + Test = ONE Task. Never separate.
-> EVERY task MUST have: Recommended Agent Profile + Parallelization info.
+> EVERY task MUST have: Recommended Agent Profile + Parallelization info + QA Scenarios.
+> **A task WITHOUT QA Scenarios is INCOMPLETE. No exceptions.**

 - [ ] 1. [Task Title]

@ -282,22 +290,15 @@ Max Concurrent: 7 (Waves 1 & 2)

  **Pattern References** (existing code to follow):
  - \`src/services/auth.ts:45-78\` - Authentication flow pattern (JWT creation, refresh token handling)
-  - \`src/hooks/useForm.ts:12-34\` - Form validation pattern (Zod schema + react-hook-form integration)

  **API/Type References** (contracts to implement against):
  - \`src/types/user.ts:UserDTO\` - Response shape for user endpoints
-  - \`src/api/schema.ts:createUserSchema\` - Request validation schema

  **Test References** (testing patterns to follow):
  - \`src/__tests__/auth.test.ts:describe("login")\` - Test structure and mocking patterns

-  **Documentation References** (specs and requirements):
-  - \`docs/api-spec.md#authentication\` - API contract details
-  - \`ARCHITECTURE.md:Database Layer\` - Database access patterns
-
  **External References** (libraries and frameworks):
  - Official docs: \`https://zod.dev/?id=basic-usage\` - Zod validation syntax
-  - Example repo: \`github.com/example/project/src/auth\` - Reference implementation

  **WHY Each Reference Matters** (explain the relevance):
  - Don't just list files - explain what pattern/information the executor should extract
@ -308,113 +309,53 @@ Max Concurrent: 7 (Waves 1 & 2)

  > **AGENT-EXECUTABLE VERIFICATION ONLY** — No human action permitted.
  > Every criterion MUST be verifiable by running a command or using a tool.
-  > REPLACE all placeholders with actual values from task context.

  **If TDD (tests enabled):**
  - [ ] Test file created: src/auth/login.test.ts
-  - [ ] Test covers: successful login returns JWT token
  - [ ] bun test src/auth/login.test.ts → PASS (3 tests, 0 failures)

-  **Agent-Executed QA Scenarios (MANDATORY — per-scenario, ultra-detailed):**
+  **QA Scenarios (MANDATORY — task is INCOMPLETE without these):**

-  > Write MULTIPLE named scenarios per task: happy path AND failure cases.
-  > Each scenario = exact tool + steps with real selectors/data + evidence path.
-
-  **Example — Frontend/UI (Playwright):**
+  > **This is NOT optional. A task without QA scenarios WILL BE REJECTED.**
+  >
+  > Write scenario tests that verify the ACTUAL BEHAVIOR of what you built.
+  > Minimum: 1 happy path + 1 failure/edge case per task.
+  > Each scenario = exact tool + exact steps + exact assertions + evidence path.
+  >
+  > **The executing agent MUST run these scenarios after implementation.**
+  > **The orchestrator WILL verify evidence files exist before marking task complete.**

  \\\`\\\`\\\`
-  Scenario: Successful login redirects to dashboard
-    Tool: Playwright (playwright skill)
-    Preconditions: Dev server running on localhost:3000, test user exists
+  Scenario: [Happy path — what SHOULD work]
+    Tool: [Playwright / interactive_bash / Bash (curl)]
+    Preconditions: [Exact setup state]
    Steps:
-      1. Navigate to: http://localhost:3000/login
-      2. Wait for: input[name="email"] visible (timeout: 5s)
-      3. Fill: input[name="email"] → "test@example.com"
-      4. Fill: input[name="password"] → "ValidPass123!"
-      5. Click: button[type="submit"]
-      6. Wait for: navigation to /dashboard (timeout: 10s)
-      7. Assert: h1 text contains "Welcome back"
-      8. Assert: cookie "session_token" exists
-      9. Screenshot: .sisyphus/evidence/task-1-login-success.png
-    Expected Result: Dashboard loads with welcome message
-    Evidence: .sisyphus/evidence/task-1-login-success.png
+      1. [Exact action — specific command/selector/endpoint, no vagueness]
+      2. [Next action — with expected intermediate state]
+      3. [Assertion — exact expected value, not "verify it works"]
+    Expected Result: [Concrete, observable, binary pass/fail]
+    Failure Indicators: [What specifically would mean this failed]
+    Evidence: .sisyphus/evidence/task-{N}-{scenario-slug}.{ext}

-  Scenario: Login fails with invalid credentials
-    Tool: Playwright (playwright skill)
-    Preconditions: Dev server running, no valid user with these credentials
+  Scenario: [Failure/edge case — what SHOULD fail gracefully]
+    Tool: [same format]
+    Preconditions: [Invalid input / missing dependency / error state]
    Steps:
-      1. Navigate to: http://localhost:3000/login
-      2. Fill: input[name="email"] → "wrong@example.com"
-      3. Fill: input[name="password"] → "WrongPass"
-      4. Click: button[type="submit"]
-      5. Wait for: .error-message visible (timeout: 5s)
-      6. Assert: .error-message text contains "Invalid credentials"
-      7. Assert: URL is still /login (no redirect)
-      8. Screenshot: .sisyphus/evidence/task-1-login-failure.png
-    Expected Result: Error message shown, stays on login page
-    Evidence: .sisyphus/evidence/task-1-login-failure.png
+      1. [Trigger the error condition]
+      2. [Assert error is handled correctly]
+    Expected Result: [Graceful failure with correct error message/code]
+    Evidence: .sisyphus/evidence/task-{N}-{scenario-slug}-error.{ext}
  \\\`\\\`\\\`

-  **Example — API/Backend (curl):**
-
-  \\\`\\\`\\\`
-  Scenario: Create user returns 201 with UUID
-    Tool: Bash (curl)
-    Preconditions: Server running on localhost:8080
-    Steps:
-      1. curl -s -w "\\n%{http_code}" -X POST http://localhost:8080/api/users \\
-           -H "Content-Type: application/json" \\
-           -d '{"email":"new@test.com","name":"Test User"}'
-      2. Assert: HTTP status is 201
-      3. Assert: response.id matches UUID format
-      4. GET /api/users/{returned-id} → Assert name equals "Test User"
-    Expected Result: User created and retrievable
-    Evidence: Response bodies captured
-
-  Scenario: Duplicate email returns 409
-    Tool: Bash (curl)
-    Preconditions: User with email "new@test.com" already exists
-    Steps:
-      1. Repeat POST with same email
-      2. Assert: HTTP status is 409
-      3. Assert: response.error contains "already exists"
-    Expected Result: Conflict error returned
-    Evidence: Response body captured
-  \\\`\\\`\\\`
-
-  **Example — TUI/CLI (interactive_bash):**
-
-  \\\`\\\`\\\`
-  Scenario: CLI loads config and displays menu
-    Tool: interactive_bash (tmux)
-    Preconditions: Binary built, test config at ./test.yaml
-    Steps:
-      1. tmux new-session: ./my-cli --config test.yaml
-      2. Wait for: "Configuration loaded" in output (timeout: 5s)
-      3. Assert: Menu items visible ("1. Create", "2. List", "3. Exit")
-      4. Send keys: "3" then Enter
-      5. Assert: "Goodbye" in output
-      6. Assert: Process exited with code 0
-    Expected Result: CLI starts, shows menu, exits cleanly
-    Evidence: Terminal output captured
-
-  Scenario: CLI handles missing config gracefully
-    Tool: interactive_bash (tmux)
-    Preconditions: No config file at ./nonexistent.yaml
-    Steps:
-      1. tmux new-session: ./my-cli --config nonexistent.yaml
-      2. Wait for: output (timeout: 3s)
-      3. Assert: stderr contains "Config file not found"
-      4. Assert: Process exited with code 1
-    Expected Result: Meaningful error, non-zero exit
-    Evidence: Error output captured
-  \\\`\\\`\\\`
+  > **Anti-patterns (your scenario is INVALID if it looks like this):**
+  > - ❌ "Verify it works correctly" — HOW? What does "correctly" mean?
+  > - ❌ "Check the API returns data" — WHAT data? What fields? What values?
+  > - ❌ "Test the component renders" — WHERE? What selector? What content?
+  > - ❌ Any scenario without an evidence path

  **Evidence to Capture:**
-  - [ ] Screenshots in .sisyphus/evidence/ for UI scenarios
-  - [ ] Terminal output for CLI/TUI scenarios
-  - [ ] Response bodies for API scenarios
  - [ ] Each evidence file named: task-{N}-{scenario-slug}.{ext}
+  - [ ] Screenshots for UI, terminal output for CLI, response bodies for API

  **Commit**: YES | NO (groups with N)
  - Message: \`type(scope): desc\`
@ -423,6 +364,158 @@ Max Concurrent: 7 (Waves 1 & 2)

 ---

+## Final Verification Wave (MANDATORY — after ALL implementation tasks)
+
+> **ALL 4 review agents run in PARALLEL after every implementation task is complete.**
+> **ALL 4 must APPROVE before the plan is considered done.**
+> **If ANY agent rejects, fix issues and re-run the rejecting agent(s).**
+
+- [ ] F1. Plan Compliance Audit
+
+  **Agent**: oracle (read-only consultation)
+
+  **What this agent does**:
+  Read the original work plan (.sisyphus/plans/{name}.md) and verify EVERY requirement was fulfilled.
+
+  **Exact verification steps**:
+  1. Read the plan file end-to-end
+  2. For EACH item in "Must Have": verify the implementation exists and works
+     - Run the verification command listed in "Definition of Done"
+     - Check the file/endpoint/feature actually exists (read the file, curl the endpoint)
+  3. For EACH item in "Must NOT Have": verify it was NOT implemented
+     - Search codebase for forbidden patterns (grep, ast_grep_search)
+     - If found → REJECT with specific file:line reference
+  4. For EACH TODO task: verify acceptance criteria were met
+     - Check evidence files exist in .sisyphus/evidence/
+     - Verify test results match expected outcomes
+  5. Compare final deliverables against "Concrete Deliverables" list
+
+  **Output format**:
+  \\\`\\\`\\\`
+  ## Plan Compliance Report
+  ### Must Have: [N/N passed]
+  - [✅/❌] [requirement]: [evidence]
+  ### Must NOT Have: [N/N clean]
+  - [✅/❌] [guardrail]: [evidence]
+  ### Task Completion: [N/N verified]
+  - [✅/❌] Task N: [criteria status]
+  ### VERDICT: APPROVE / REJECT
+  ### Rejection Reasons (if any): [specific issues]
+  \\\`\\\`\\\`
+
+- [ ] F2. Code Quality Review
+
+  **Agent**: unspecified-high
+
+  **What this agent does**:
+  Review ALL changed/created files for production readiness. This is NOT a rubber stamp.
+
+  **Exact verification steps**:
+  1. Run full type check: \`bunx tsc --noEmit\` (or project equivalent) → must exit 0
+  2. Run linter if configured: \`bunx biome check .\` / \`bunx eslint .\` → must pass
+  3. Run full test suite: \`bun test\` → all tests pass, zero failures
+  4. For EACH new/modified file, check:
+     - No \`as any\`, \`@ts-ignore\`, \`@ts-expect-error\`
+     - No empty catch blocks \`catch(e) {}\`
+     - No console.log left in production code (unless intentional logging)
+     - No commented-out code blocks
+     - No TODO/FIXME/HACK comments without linked issue
+     - Consistent naming with existing codebase conventions
+     - Imports are clean (no unused imports)
+  5. Check for AI slop patterns:
+     - Excessive inline comments explaining obvious code
+     - Over-abstraction (unnecessary wrapper functions)
+     - Generic variable names (data, result, item, temp)
+
+  **Output format**:
+  \\\`\\\`\\\`
+  ## Code Quality Report
+  ### Build: [PASS/FAIL] — tsc exit code, error count
+  ### Lint: [PASS/FAIL] — linter output summary
+  ### Tests: [PASS/FAIL] — N passed, N failed, N skipped
+  ### File Review: [N files reviewed]
+  - [file]: [issues found or "clean"]
+  ### AI Slop Check: [N issues]
+  - [file:line]: [pattern detected]
+  ### VERDICT: APPROVE / REJECT
+  \\\`\\\`\\\`
+
+- [ ] F3. Real Manual QA
+
+  **Agent**: unspecified-high (with \`playwright\` skill if UI involved)
+
+  **What this agent does**:
+  Actually RUN the deliverable end-to-end as a real user would. No mocks, no shortcuts.
+
+  **Exact verification steps**:
+  1. Start the application/service from scratch (clean state)
+  2. Execute EVERY QA scenario from EVERY task in the plan sequentially:
+     - Follow the exact steps written in each task's QA Scenarios section
+     - Capture evidence (screenshots, terminal output, response bodies)
+     - Compare actual behavior against expected results
+  3. Test cross-task integration:
+     - Does feature A work correctly WITH feature B? (not just in isolation)
+     - Does the full user flow work end-to-end?
+  4. Test edge cases not covered by individual tasks:
+     - Empty state / first-time use
+     - Rapid repeated actions
+     - Invalid/malformed input
+     - Network interruption (if applicable)
+  5. Save ALL evidence to .sisyphus/evidence/final-qa/
+
+  **Output format**:
+  \\\`\\\`\\\`
+  ## Manual QA Report
+  ### Scenarios Executed: [N/N passed]
+  - [✅/❌] Task N - Scenario name: [result]
+  ### Integration Tests: [N/N passed]
+  - [✅/❌] [flow name]: [result]
+  ### Edge Cases: [N tested]
+  - [✅/❌] [case]: [result]
+  ### Evidence: .sisyphus/evidence/final-qa/
+  ### VERDICT: APPROVE / REJECT
+  \\\`\\\`\\\`
+
+- [ ] F4. Scope Fidelity Check
+
+  **Agent**: deep
+
+  **What this agent does**:
+  Verify that EACH task implemented EXACTLY what was specified — no more, no less.
+  Catches scope creep, missing features, and unauthorized additions.
+
+  **Exact verification steps**:
+  1. For EACH completed task in the plan:
+     a. Read the task's "What to do" section
+     b. Read the actual diff/files created for that task (git log, git diff, file reads)
+     c. Verify 1:1 correspondence:
+        - Everything in "What to do" was implemented → no missing features
+        - Nothing BEYOND "What to do" was implemented → no scope creep
+     d. Read the task's "Must NOT do" section
+     e. Verify NONE of the forbidden items were implemented
+  2. Check for unauthorized cross-task contamination:
+     - Did Task 5 accidentally implement something that belongs to Task 8?
+     - Are there files modified that don't belong to any task?
+  3. Verify each task's boundaries are respected:
+     - No task touches files outside its stated scope
+     - No task implements functionality assigned to a different task
+
+  **Output format**:
+  \\\`\\\`\\\`
+  ## Scope Fidelity Report
+  ### Task-by-Task Audit: [N/N compliant]
+  - [✅/❌] Task N: [compliance status]
+    - Implemented: [list of what was done]
+    - Missing: [anything from "What to do" not found]
+    - Excess: [anything done that wasn't in "What to do"]
+    - "Must NOT do" violations: [list or "none"]
+  ### Cross-Task Contamination: [CLEAN / N issues]
+  ### Unaccounted Changes: [CLEAN / N files]
+  ### VERDICT: APPROVE / REJECT
+  \\\`\\\`\\\`
+
+---
+
 ## Commit Strategy

 | After Task | Message | Files | Verification |