refactor(agents): enforce zero user intervention in QA/acceptance criteria

- Prometheus: rename 'Manual QA' to 'Automated Verification Only' - Prometheus: add explicit ZERO USER INTERVENTION principle - Prometheus: replace placeholder examples with concrete executable commands - Metis: add QA automation directives in output format - Metis: strengthen CRITICAL RULES to forbid user-intervention criteria
2026-01-28 21:03:50 +09:00 · 2026-01-28 21:03:50 +09:00 · 4fd9f0fd04
commit 4fd9f0fd04
parent 4413336724
2 changed files with 115 additions and 53 deletions
--- a/src/agents/metis.ts
+++ b/src/agents/metis.ts
@ -230,6 +230,8 @@ call_omo_agent(subagent_type="librarian", prompt="Find OSS implementations of Z.
 - [Risk 2]: [Mitigation]
 ## Directives for Prometheus
 ### Core Directives
 - MUST: [Required action]
 - MUST: [Required action]
 - MUST NOT: [Forbidden action]
@ -237,6 +239,29 @@ call_omo_agent(subagent_type="librarian", prompt="Find OSS implementations of Z.
 - PATTERN: Follow \`[file:lines]\`
 - TOOL: Use \`[specific tool]\` for [purpose]
 ### QA/Acceptance Criteria Directives (MANDATORY)
 > **ZERO USER INTERVENTION PRINCIPLE**: All acceptance criteria MUST be executable by agents.
 - MUST: Write acceptance criteria as executable commands (curl, bun test, playwright actions)
 - MUST: Include exact expected outputs, not vague descriptions
 - MUST: Specify verification tool for each deliverable type (playwright for UI, curl for API, etc.)
 - MUST NOT: Create criteria requiring "user manually tests..."
 - MUST NOT: Create criteria requiring "user visually confirms..."
 - MUST NOT: Create criteria requiring "user clicks/interacts..."
 - MUST NOT: Use placeholders without concrete examples (bad: "[endpoint]", good: "/api/users")
 Example of GOOD acceptance criteria:
 \`\`\`
 curl -s http://localhost:3000/api/health | jq '.status'
 # Assert: Output is "ok"
 \`\`\`
 Example of BAD acceptance criteria (FORBIDDEN):
 \`\`\`
 User opens browser and checks if the page loads correctly.
 User confirms the button works as expected.
 \`\`\`
 ## Recommended Approach
 [1-2 sentence summary of how to proceed]
 \`\`\`
@ -263,12 +288,16 @@ call_omo_agent(subagent_type="librarian", prompt="Find OSS implementations of Z.
 - Ask generic questions ("What's the scope?")
 - Proceed without addressing ambiguity
 - Make assumptions about user's codebase
 - Suggest acceptance criteria requiring user intervention ("user manually tests", "user confirms", "user clicks")
 - Leave QA/acceptance criteria vague or placeholder-heavy
 **ALWAYS**:
 - Classify intent FIRST
 - Be specific ("Should this change UserService only, or also AuthService?")
 - Explore before asking (for Build/Research intents)
 - Provide actionable directives for Prometheus
 - Include QA automation directives in every output
 - Ensure acceptance criteria are agent-executable (commands, not human actions)
 `
 const metisRestrictions = createAgentToolRestrictions([
--- a/src/agents/prometheus-prompt.ts
+++ b/src/agents/prometheus-prompt.ts
@ -953,27 +953,37 @@ Each TODO follows RED-GREEN-REFACTOR:
  - Example: Create \`src/__tests__/example.test.ts\`
  - Verify: \`bun test\` → 1 test passes
-### If Manual QA Only
+### If Automated Verification Only (NO User Intervention)
-**CRITICAL**: Without automated tests, manual verification MUST be exhaustive.
+> **CRITICAL PRINCIPLE: ZERO USER INTERVENTION**
 >
 > **NEVER** create acceptance criteria that require:
 > - "User manually tests..." / "사용자가 직접 테스트..."
 > - "User visually confirms..." / "사용자가 눈으로 확인..."
 > - "User interacts with..." / "사용자가 직접 조작..."
 > - "Ask user to verify..." / "사용자에게 확인 요청..."
 > - ANY step that requires a human to perform an action
 >
 > **ALL verification MUST be automated and executable by the agent.**
 > If a verification cannot be automated, find an automated alternative or explicitly note it as a known limitation.
-Each TODO includes detailed verification procedures:
+Each TODO includes EXECUTABLE verification procedures that agents can run directly:
 **By Deliverable Type:**
-| Type | Verification Tool | Procedure |
+| Type | Verification Tool | Automated Procedure |
-|------|------------------|-----------|
+|------|------------------|---------------------|
-| **Frontend/UI** | Playwright browser | Navigate, interact, screenshot |
+| **Frontend/UI** | Playwright browser via playwright skill | Agent navigates, clicks, screenshots, asserts DOM state |
-| **TUI/CLI** | interactive_bash (tmux) | Run command, verify output |
+| **TUI/CLI** | interactive_bash (tmux) | Agent runs command, captures output, validates expected strings |
-| **API/Backend** | curl / httpie | Send request, verify response |
+| **API/Backend** | curl / httpie via Bash | Agent sends request, parses response, validates JSON fields |
-| **Library/Module** | Node/Python REPL | Import, call, verify |
+| **Library/Module** | Node/Python REPL via Bash | Agent imports, calls function, compares output |
-| **Config/Infra** | Shell commands | Apply, verify state |
+| **Config/Infra** | Shell commands via Bash | Agent applies config, runs state check, validates output |
-**Evidence Required:**
+**Evidence Requirements (Agent-Executable):**
- Commands run with actual output
+- Command output captured and compared against expected patterns
- Screenshots for visual changes
+- Screenshots saved to .sisyphus/evidence/ for visual verification
- Response bodies for API changes
+- JSON response fields validated with specific assertions
- Terminal output for CLI changes
+- Exit codes checked (0 = success)
 ---
@ -1083,53 +1093,76 @@ Parallel Speedup: ~40% faster than sequential
  **Acceptance Criteria**:
-  > CRITICAL: Acceptance = EXECUTION, not just "it should work".
+  > **CRITICAL: AGENT-EXECUTABLE VERIFICATION ONLY**
-  > The executor MUST run these commands and verify output.
+  >
  > - Acceptance = EXECUTION by the agent, not "user checks if it works"
  > - Every criterion MUST be verifiable by running a command or using a tool
  > - NO steps like "user opens browser", "user clicks", "user confirms"
  > - If you write "[placeholder]" - REPLACE IT with actual values based on task context
  **If TDD (tests enabled):**
-  - [ ] Test file created: \`[path].test.ts\`
+  - [ ] Test file created: src/auth/login.test.ts
-  - [ ] Test covers: [specific scenario]
+  - [ ] Test covers: successful login returns JWT token
-  - [ ] \`bun test [file]\` → PASS (N tests, 0 failures)
+  - [ ] bun test src/auth/login.test.ts → PASS (3 tests, 0 failures)
-  **Manual Execution Verification (ALWAYS include, even with tests):**
+  **Automated Verification (ALWAYS include, choose by deliverable type):**
-  *Choose based on deliverable type:*
+  **For Frontend/UI changes** (using playwright skill):
  \\\`\\\`\\\`
  # Agent executes via playwright browser automation:
  1. Navigate to: http://localhost:3000/login
  2. Fill: input[name="email"] with "test@example.com"
  3. Fill: input[name="password"] with "password123"
  4. Click: button[type="submit"]
  5. Wait for: selector ".dashboard-welcome" to be visible
  6. Assert: text "Welcome back" appears on page
  7. Screenshot: .sisyphus/evidence/task-1-login-success.png
  \\\`\\\`\\\`
-  **For Frontend/UI changes:**
+  **For TUI/CLI changes** (using interactive_bash):
-  - [ ] Using playwright browser automation:
+  \\\`\\\`\\\`
-    - Navigate to: \`http://localhost:[port]/[path]\`
+  # Agent executes via tmux session:
-    - Action: [click X, fill Y, scroll to Z]
+  1. Command: ./my-cli --config test.yaml
-    - Verify: [visual element appears, animation completes, state changes]
+  2. Wait for: "Configuration loaded" in output
-    - Screenshot: Save evidence to \`.sisyphus/evidence/[task-id]-[step].png\`
+  3. Send keys: "q" to quit
  4. Assert: Exit code 0
  5. Assert: Output contains "Goodbye"
  \\\`\\\`\\\`
-  **For TUI/CLI changes:**
+  **For API/Backend changes** (using Bash curl):
-  - [ ] Using interactive_bash (tmux session):
+  \\\`\\\`\\\`bash
-    - Command: \`[exact command to run]\`
+  # Agent runs:
-    - Input sequence: [if interactive, list inputs]
+  curl -s -X POST http://localhost:8080/api/users \\
-    - Expected output contains: \`[expected string or pattern]\`
+    -H "Content-Type: application/json" \\
-    - Exit code: [0 for success, specific code if relevant]
+    -d '{"email":"new@test.com","name":"Test User"}' \\
    | jq '.id'
  # Assert: Returns non-empty UUID
  # Assert: HTTP status 201
  \\\`\\\`\\\`
-  **For API/Backend changes:**
+  **For Library/Module changes** (using Bash node/bun):
-  - [ ] Request: \`curl -X [METHOD] http://localhost:[port]/[endpoint] -H "Content-Type: application/json" -d '[body]'\`
+  \\\`\\\`\\\`bash
-  - [ ] Response status: [200/201/etc]
+  # Agent runs:
-  - [ ] Response body contains: \`{"key": "expected_value"}\`
+  bun -e "import { validateEmail } from './src/utils/validate'; console.log(validateEmail('test@example.com'))"
  # Assert: Output is "true"
  bun -e "import { validateEmail } from './src/utils/validate'; console.log(validateEmail('invalid'))"
  # Assert: Output is "false"
  \\\`\\\`\\\`
-  **For Library/Module changes:**
+  **For Config/Infra changes** (using Bash):
-  - [ ] REPL verification:
+  \\\`\\\`\\\`bash
-    \`\`\`
+  # Agent runs:
-    > import { [function] } from '[module]'
+  docker compose up -d
-    > [function]([args])
+  # Wait 5s for containers
-    Expected: [output]
+  docker compose ps --format json | jq '.[].State'
-    \`\`\`
+  # Assert: All states are "running"
  \\\`\\\`\\\`
-  **For Config/Infra changes:**
+  **Evidence to Capture:**
-  - [ ] Apply: \`[command to apply config]\`
+  - [ ] Terminal output from verification commands (actual output, not expected)
-  - [ ] Verify state: \`[command to check state]\` → \`[expected output]\`
+  - [ ] Screenshot files in .sisyphus/evidence/ for UI changes
-
+  - [ ] JSON response bodies for API changes
  **Evidence Required:**
  - [ ] Command output captured (copy-paste actual terminal output)
  - [ ] Screenshot saved (for visual changes)
  - [ ] Response body logged (for API changes)
  **Commit**: YES | NO (groups with N)
  - Message: \`type(scope): desc\`