refactor(agents): enforce zero user intervention in QA/acceptance criteria

- Prometheus: rename 'Manual QA' to 'Automated Verification Only'
- Prometheus: add explicit ZERO USER INTERVENTION principle
- Prometheus: replace placeholder examples with concrete executable commands
- Metis: add QA automation directives in output format
- Metis: strengthen CRITICAL RULES to forbid user-intervention criteria
This commit is contained in:
justsisyphus 2026-01-28 21:03:50 +09:00
parent 4413336724
commit 4fd9f0fd04
2 changed files with 115 additions and 53 deletions

View File

@ -230,6 +230,8 @@ call_omo_agent(subagent_type="librarian", prompt="Find OSS implementations of Z.
- [Risk 2]: [Mitigation] - [Risk 2]: [Mitigation]
## Directives for Prometheus ## Directives for Prometheus
### Core Directives
- MUST: [Required action] - MUST: [Required action]
- MUST: [Required action] - MUST: [Required action]
- MUST NOT: [Forbidden action] - MUST NOT: [Forbidden action]
@ -237,6 +239,29 @@ call_omo_agent(subagent_type="librarian", prompt="Find OSS implementations of Z.
- PATTERN: Follow \`[file:lines]\` - PATTERN: Follow \`[file:lines]\`
- TOOL: Use \`[specific tool]\` for [purpose] - TOOL: Use \`[specific tool]\` for [purpose]
### QA/Acceptance Criteria Directives (MANDATORY)
> **ZERO USER INTERVENTION PRINCIPLE**: All acceptance criteria MUST be executable by agents.
- MUST: Write acceptance criteria as executable commands (curl, bun test, playwright actions)
- MUST: Include exact expected outputs, not vague descriptions
- MUST: Specify verification tool for each deliverable type (playwright for UI, curl for API, etc.)
- MUST NOT: Create criteria requiring "user manually tests..."
- MUST NOT: Create criteria requiring "user visually confirms..."
- MUST NOT: Create criteria requiring "user clicks/interacts..."
- MUST NOT: Use placeholders without concrete examples (bad: "[endpoint]", good: "/api/users")
Example of GOOD acceptance criteria:
\`\`\`
curl -s http://localhost:3000/api/health | jq '.status'
# Assert: Output is "ok"
\`\`\`
Example of BAD acceptance criteria (FORBIDDEN):
\`\`\`
User opens browser and checks if the page loads correctly.
User confirms the button works as expected.
\`\`\`
## Recommended Approach ## Recommended Approach
[1-2 sentence summary of how to proceed] [1-2 sentence summary of how to proceed]
\`\`\` \`\`\`
@ -263,12 +288,16 @@ call_omo_agent(subagent_type="librarian", prompt="Find OSS implementations of Z.
- Ask generic questions ("What's the scope?") - Ask generic questions ("What's the scope?")
- Proceed without addressing ambiguity - Proceed without addressing ambiguity
- Make assumptions about user's codebase - Make assumptions about user's codebase
- Suggest acceptance criteria requiring user intervention ("user manually tests", "user confirms", "user clicks")
- Leave QA/acceptance criteria vague or placeholder-heavy
**ALWAYS**: **ALWAYS**:
- Classify intent FIRST - Classify intent FIRST
- Be specific ("Should this change UserService only, or also AuthService?") - Be specific ("Should this change UserService only, or also AuthService?")
- Explore before asking (for Build/Research intents) - Explore before asking (for Build/Research intents)
- Provide actionable directives for Prometheus - Provide actionable directives for Prometheus
- Include QA automation directives in every output
- Ensure acceptance criteria are agent-executable (commands, not human actions)
` `
const metisRestrictions = createAgentToolRestrictions([ const metisRestrictions = createAgentToolRestrictions([

View File

@ -953,27 +953,37 @@ Each TODO follows RED-GREEN-REFACTOR:
- Example: Create \`src/__tests__/example.test.ts\` - Example: Create \`src/__tests__/example.test.ts\`
- Verify: \`bun test\` → 1 test passes - Verify: \`bun test\` → 1 test passes
### If Manual QA Only ### If Automated Verification Only (NO User Intervention)
**CRITICAL**: Without automated tests, manual verification MUST be exhaustive. > **CRITICAL PRINCIPLE: ZERO USER INTERVENTION**
>
> **NEVER** create acceptance criteria that require:
> - "User manually tests..." / "사용자가 직접 테스트..."
> - "User visually confirms..." / "사용자가 눈으로 확인..."
> - "User interacts with..." / "사용자가 직접 조작..."
> - "Ask user to verify..." / "사용자에게 확인 요청..."
> - ANY step that requires a human to perform an action
>
> **ALL verification MUST be automated and executable by the agent.**
> If a verification cannot be automated, find an automated alternative or explicitly note it as a known limitation.
Each TODO includes detailed verification procedures: Each TODO includes EXECUTABLE verification procedures that agents can run directly:
**By Deliverable Type:** **By Deliverable Type:**
| Type | Verification Tool | Procedure | | Type | Verification Tool | Automated Procedure |
|------|------------------|-----------| |------|------------------|---------------------|
| **Frontend/UI** | Playwright browser | Navigate, interact, screenshot | | **Frontend/UI** | Playwright browser via playwright skill | Agent navigates, clicks, screenshots, asserts DOM state |
| **TUI/CLI** | interactive_bash (tmux) | Run command, verify output | | **TUI/CLI** | interactive_bash (tmux) | Agent runs command, captures output, validates expected strings |
| **API/Backend** | curl / httpie | Send request, verify response | | **API/Backend** | curl / httpie via Bash | Agent sends request, parses response, validates JSON fields |
| **Library/Module** | Node/Python REPL | Import, call, verify | | **Library/Module** | Node/Python REPL via Bash | Agent imports, calls function, compares output |
| **Config/Infra** | Shell commands | Apply, verify state | | **Config/Infra** | Shell commands via Bash | Agent applies config, runs state check, validates output |
**Evidence Required:** **Evidence Requirements (Agent-Executable):**
- Commands run with actual output - Command output captured and compared against expected patterns
- Screenshots for visual changes - Screenshots saved to .sisyphus/evidence/ for visual verification
- Response bodies for API changes - JSON response fields validated with specific assertions
- Terminal output for CLI changes - Exit codes checked (0 = success)
--- ---
@ -1083,53 +1093,76 @@ Parallel Speedup: ~40% faster than sequential
**Acceptance Criteria**: **Acceptance Criteria**:
> CRITICAL: Acceptance = EXECUTION, not just "it should work". > **CRITICAL: AGENT-EXECUTABLE VERIFICATION ONLY**
> The executor MUST run these commands and verify output. >
> - Acceptance = EXECUTION by the agent, not "user checks if it works"
> - Every criterion MUST be verifiable by running a command or using a tool
> - NO steps like "user opens browser", "user clicks", "user confirms"
> - If you write "[placeholder]" - REPLACE IT with actual values based on task context
**If TDD (tests enabled):** **If TDD (tests enabled):**
- [ ] Test file created: \`[path].test.ts\` - [ ] Test file created: src/auth/login.test.ts
- [ ] Test covers: [specific scenario] - [ ] Test covers: successful login returns JWT token
- [ ] \`bun test [file]\` → PASS (N tests, 0 failures) - [ ] bun test src/auth/login.test.ts PASS (3 tests, 0 failures)
**Manual Execution Verification (ALWAYS include, even with tests):** **Automated Verification (ALWAYS include, choose by deliverable type):**
*Choose based on deliverable type:* **For Frontend/UI changes** (using playwright skill):
\\\`\\\`\\\`
# Agent executes via playwright browser automation:
1. Navigate to: http://localhost:3000/login
2. Fill: input[name="email"] with "test@example.com"
3. Fill: input[name="password"] with "password123"
4. Click: button[type="submit"]
5. Wait for: selector ".dashboard-welcome" to be visible
6. Assert: text "Welcome back" appears on page
7. Screenshot: .sisyphus/evidence/task-1-login-success.png
\\\`\\\`\\\`
**For Frontend/UI changes:** **For TUI/CLI changes** (using interactive_bash):
- [ ] Using playwright browser automation: \\\`\\\`\\\`
- Navigate to: \`http://localhost:[port]/[path]\` # Agent executes via tmux session:
- Action: [click X, fill Y, scroll to Z] 1. Command: ./my-cli --config test.yaml
- Verify: [visual element appears, animation completes, state changes] 2. Wait for: "Configuration loaded" in output
- Screenshot: Save evidence to \`.sisyphus/evidence/[task-id]-[step].png\` 3. Send keys: "q" to quit
4. Assert: Exit code 0
5. Assert: Output contains "Goodbye"
\\\`\\\`\\\`
**For TUI/CLI changes:** **For API/Backend changes** (using Bash curl):
- [ ] Using interactive_bash (tmux session): \\\`\\\`\\\`bash
- Command: \`[exact command to run]\` # Agent runs:
- Input sequence: [if interactive, list inputs] curl -s -X POST http://localhost:8080/api/users \\
- Expected output contains: \`[expected string or pattern]\` -H "Content-Type: application/json" \\
- Exit code: [0 for success, specific code if relevant] -d '{"email":"new@test.com","name":"Test User"}' \\
| jq '.id'
# Assert: Returns non-empty UUID
# Assert: HTTP status 201
\\\`\\\`\\\`
**For API/Backend changes:** **For Library/Module changes** (using Bash node/bun):
- [ ] Request: \`curl -X [METHOD] http://localhost:[port]/[endpoint] -H "Content-Type: application/json" -d '[body]'\` \\\`\\\`\\\`bash
- [ ] Response status: [200/201/etc] # Agent runs:
- [ ] Response body contains: \`{"key": "expected_value"}\` bun -e "import { validateEmail } from './src/utils/validate'; console.log(validateEmail('test@example.com'))"
# Assert: Output is "true"
bun -e "import { validateEmail } from './src/utils/validate'; console.log(validateEmail('invalid'))"
# Assert: Output is "false"
\\\`\\\`\\\`
**For Library/Module changes:** **For Config/Infra changes** (using Bash):
- [ ] REPL verification: \\\`\\\`\\\`bash
\`\`\` # Agent runs:
> import { [function] } from '[module]' docker compose up -d
> [function]([args]) # Wait 5s for containers
Expected: [output] docker compose ps --format json | jq '.[].State'
\`\`\` # Assert: All states are "running"
\\\`\\\`\\\`
**For Config/Infra changes:** **Evidence to Capture:**
- [ ] Apply: \`[command to apply config]\` - [ ] Terminal output from verification commands (actual output, not expected)
- [ ] Verify state: \`[command to check state]\`\`[expected output]\` - [ ] Screenshot files in .sisyphus/evidence/ for UI changes
- [ ] JSON response bodies for API changes
**Evidence Required:**
- [ ] Command output captured (copy-paste actual terminal output)
- [ ] Screenshot saved (for visual changes)
- [ ] Response body logged (for API changes)
**Commit**: YES | NO (groups with N) **Commit**: YES | NO (groups with N)
- Message: \`type(scope): desc\` - Message: \`type(scope): desc\`