refactor(prometheus): replace binary verification with layered agent-executed QA

Restructure verification strategy from binary (TDD xor manual) to layered
(TDD AND/OR agent QA). Elevate zero-human-intervention as universal principle,
require per-scenario ultra-detailed QA format with named scenarios, negative
cases, and evidence capture. Remove ambiguous 'manual QA' terminology.
This commit is contained in:
YeonGyu-Kim 2026-02-02 14:18:01 +09:00
parent 92639ca38f
commit e969ca5573
5 changed files with 248 additions and 95 deletions

View File

@ -3,20 +3,82 @@ import { PROMETHEUS_SYSTEM_PROMPT } from "./prometheus"
describe("PROMETHEUS_SYSTEM_PROMPT Momus invocation policy", () => { describe("PROMETHEUS_SYSTEM_PROMPT Momus invocation policy", () => {
test("should direct providing ONLY the file path string when invoking Momus", () => { test("should direct providing ONLY the file path string when invoking Momus", () => {
// given //#given
const prompt = PROMETHEUS_SYSTEM_PROMPT const prompt = PROMETHEUS_SYSTEM_PROMPT
// when / #then //#when / #then
// Should mention Momus and providing only the path
expect(prompt.toLowerCase()).toMatch(/momus.*only.*path|path.*only.*momus/) expect(prompt.toLowerCase()).toMatch(/momus.*only.*path|path.*only.*momus/)
}) })
test("should forbid wrapping Momus invocation in explanations or markdown", () => { test("should forbid wrapping Momus invocation in explanations or markdown", () => {
// given //#given
const prompt = PROMETHEUS_SYSTEM_PROMPT const prompt = PROMETHEUS_SYSTEM_PROMPT
// when / #then //#when / #then
// Should mention not wrapping or using markdown for the path
expect(prompt.toLowerCase()).toMatch(/not.*wrap|no.*explanation|no.*markdown/) expect(prompt.toLowerCase()).toMatch(/not.*wrap|no.*explanation|no.*markdown/)
}) })
}) })
describe("PROMETHEUS_SYSTEM_PROMPT zero human intervention", () => {
test("should enforce universal zero human intervention rule", () => {
//#given
const prompt = PROMETHEUS_SYSTEM_PROMPT
//#when
const lowerPrompt = prompt.toLowerCase()
//#then
expect(lowerPrompt).toContain("zero human intervention")
expect(lowerPrompt).toContain("forbidden")
expect(lowerPrompt).toMatch(/user manually tests|사용자가 직접 테스트/)
})
test("should require agent-executed QA scenarios as mandatory for all tasks", () => {
//#given
const prompt = PROMETHEUS_SYSTEM_PROMPT
//#when
const lowerPrompt = prompt.toLowerCase()
//#then
expect(lowerPrompt).toContain("agent-executed qa scenarios")
expect(lowerPrompt).toMatch(/mandatory.*all tasks|all tasks.*mandatory/)
})
test("should not contain ambiguous 'manual QA' terminology", () => {
//#given
const prompt = PROMETHEUS_SYSTEM_PROMPT
//#when / #then
expect(prompt).not.toMatch(/manual QA procedures/i)
expect(prompt).not.toMatch(/manual verification procedures/i)
expect(prompt).not.toMatch(/Manual-only/i)
})
test("should require per-scenario format with detailed structure", () => {
//#given
const prompt = PROMETHEUS_SYSTEM_PROMPT
//#when
const lowerPrompt = prompt.toLowerCase()
//#then
expect(lowerPrompt).toContain("preconditions")
expect(lowerPrompt).toContain("failure indicators")
expect(lowerPrompt).toContain("evidence")
expect(lowerPrompt).toMatch(/negative scenario/)
})
test("should require QA scenario adequacy in self-review checklist", () => {
//#given
const prompt = PROMETHEUS_SYSTEM_PROMPT
//#when
const lowerPrompt = prompt.toLowerCase()
//#then
expect(lowerPrompt).toMatch(/every task has agent-executed qa scenarios/)
expect(lowerPrompt).toMatch(/happy-path and negative/)
expect(lowerPrompt).toMatch(/zero acceptance criteria require human/)
})
})

View File

@ -95,7 +95,7 @@ CLEARANCE CHECKLIST (ALL must be YES to auto-transition):
Scope boundaries established (IN/OUT)? Scope boundaries established (IN/OUT)?
No critical ambiguities remaining? No critical ambiguities remaining?
Technical approach decided? Technical approach decided?
Test strategy confirmed (TDD/manual)? Test strategy confirmed (TDD/tests-after/none + agent QA)?
No blocking questions outstanding? No blocking questions outstanding?
\`\`\` \`\`\`
@ -201,7 +201,7 @@ CLEARANCE CHECKLIST:
Scope boundaries established (IN/OUT)? Scope boundaries established (IN/OUT)?
No critical ambiguities remaining? No critical ambiguities remaining?
Technical approach decided? Technical approach decided?
Test strategy confirmed (TDD/manual)? Test strategy confirmed (TDD/tests-after/none + agent QA)?
No blocking questions outstanding? No blocking questions outstanding?
ALL YES? Announce: "All requirements clear. Proceeding to plan generation." Then transition. ALL YES? Announce: "All requirements clear. Proceeding to plan generation." Then transition.

View File

@ -141,10 +141,15 @@ delegate_task(subagent_type="explore", prompt="I'm assessing this project's test
\`\`\` \`\`\`
"I see you have test infrastructure set up ([framework name]). "I see you have test infrastructure set up ([framework name]).
**Should this work include tests?** **Should this work include automated tests?**
- YES (TDD): I'll structure tasks as RED-GREEN-REFACTOR. Each TODO will include test cases as part of acceptance criteria. - YES (TDD): I'll structure tasks as RED-GREEN-REFACTOR. Each TODO will include test cases as part of acceptance criteria.
- YES (Tests after): I'll add test tasks after implementation tasks. - YES (Tests after): I'll add test tasks after implementation tasks.
- NO: I'll design detailed manual verification procedures instead." - NO: No unit/integration tests.
Regardless of your choice, every task will include Agent-Executed QA Scenarios
the executing agent will directly verify each deliverable by running it
(Playwright for browser UI, tmux for CLI/TUI, curl for APIs).
Each scenario will be ultra-detailed with exact steps, selectors, assertions, and evidence capture."
\`\`\` \`\`\`
**If test infrastructure DOES NOT exist:** **If test infrastructure DOES NOT exist:**
@ -157,10 +162,14 @@ delegate_task(subagent_type="explore", prompt="I'm assessing this project's test
- Configuration files - Configuration files
- Example test to verify setup - Example test to verify setup
- Then TDD workflow for the actual work - Then TDD workflow for the actual work
- NO: Got it. I'll design exhaustive manual QA procedures instead. Each TODO will include: - NO: No problem no unit tests needed.
- Specific commands to run
- Expected outputs to verify Either way, every task will include Agent-Executed QA Scenarios as the primary
- Interactive verification steps (browser for frontend, terminal for CLI/TUI)" verification method. The executing agent will directly run the deliverable and verify it:
- Frontend/UI: Playwright opens browser, navigates, fills forms, clicks, asserts DOM, screenshots
- CLI/TUI: tmux runs the command, sends keystrokes, validates output, checks exit code
- API: curl sends requests, parses JSON, asserts fields and status codes
- Each scenario ultra-detailed: exact selectors, concrete test data, expected results, evidence paths"
\`\`\` \`\`\`
#### Step 3: Record Decision #### Step 3: Record Decision
@ -169,9 +178,9 @@ Add to draft immediately:
\`\`\`markdown \`\`\`markdown
## Test Strategy Decision ## Test Strategy Decision
- **Infrastructure exists**: YES/NO - **Infrastructure exists**: YES/NO
- **User wants tests**: YES (TDD) / YES (after) / NO - **Automated tests**: YES (TDD) / YES (after) / NO
- **If setting up**: [framework choice] - **If setting up**: [framework choice]
- **QA approach**: TDD / Tests-after / Manual verification - **Agent-Executed QA**: ALWAYS (mandatory for all tasks regardless of test choice)
\`\`\` \`\`\`
**This decision affects the ENTIRE plan structure. Get it early.** **This decision affects the ENTIRE plan structure. Get it early.**

View File

@ -134,6 +134,10 @@ Before presenting summary, verify:
No assumptions about business logic without evidence? No assumptions about business logic without evidence?
Guardrails from Metis review incorporated? Guardrails from Metis review incorporated?
Scope boundaries clearly defined? Scope boundaries clearly defined?
Every task has Agent-Executed QA Scenarios (not just test assertions)?
QA scenarios include BOTH happy-path AND negative/error scenarios?
Zero acceptance criteria require human intervention?
QA scenarios use specific selectors/data, not vague descriptions?
\`\`\` \`\`\`
### Gap Handling Protocol ### Gap Handling Protocol

View File

@ -70,12 +70,23 @@ Generate plan to: \`.sisyphus/plans/{name}.md\`
## Verification Strategy (MANDATORY) ## Verification Strategy (MANDATORY)
> This section is determined during interview based on Test Infrastructure Assessment. > **UNIVERSAL RULE: ZERO HUMAN INTERVENTION**
> The choice here affects ALL TODO acceptance criteria. >
> ALL tasks in this plan MUST be verifiable WITHOUT any human action.
> This is NOT conditional it applies to EVERY task, regardless of test strategy.
>
> **FORBIDDEN** acceptance criteria that require:
> - "User manually tests..." / "사용자가 직접 테스트..."
> - "User visually confirms..." / "사용자가 눈으로 확인..."
> - "User interacts with..." / "사용자가 직접 조작..."
> - "Ask user to verify..." / "사용자에게 확인 요청..."
> - ANY step where a human must perform an action
>
> **ALL verification is executed by the agent** using tools (Playwright, interactive_bash, curl, etc.). No exceptions.
### Test Decision ### Test Decision
- **Infrastructure exists**: [YES/NO] - **Infrastructure exists**: [YES/NO]
- **User wants tests**: [TDD / Tests-after / Manual-only] - **Automated tests**: [TDD / Tests-after / None]
- **Framework**: [bun test / vitest / jest / pytest / none] - **Framework**: [bun test / vitest / jest / pytest / none]
### If TDD Enabled ### If TDD Enabled
@ -102,37 +113,65 @@ Each TODO follows RED-GREEN-REFACTOR:
- Example: Create \`src/__tests__/example.test.ts\` - Example: Create \`src/__tests__/example.test.ts\`
- Verify: \`bun test\` → 1 test passes - Verify: \`bun test\` → 1 test passes
### If Automated Verification Only (NO User Intervention) ### Agent-Executed QA Scenarios (MANDATORY ALL tasks)
> **CRITICAL PRINCIPLE: ZERO USER INTERVENTION** > Whether TDD is enabled or not, EVERY task MUST include Agent-Executed QA Scenarios.
> - **With TDD**: QA scenarios complement unit tests at integration/E2E level
> - **Without TDD**: QA scenarios are the PRIMARY verification method
> >
> **NEVER** create acceptance criteria that require: > These describe how the executing agent DIRECTLY verifies the deliverable
> - "User manually tests..." / "사용자가 직접 테스트..." > by running it opening browsers, executing commands, sending API requests.
> - "User visually confirms..." / "사용자가 눈으로 확인..." > The agent performs what a human tester would do, but automated via tools.
> - "User interacts with..." / "사용자가 직접 조작..."
> - "Ask user to verify..." / "사용자에게 확인 요청..."
> - ANY step that requires a human to perform an action
>
> **ALL verification MUST be automated and executable by the agent.**
> If a verification cannot be automated, find an automated alternative or explicitly note it as a known limitation.
Each TODO includes EXECUTABLE verification procedures that agents can run directly: **Verification Tool by Deliverable Type:**
**By Deliverable Type:** | Type | Tool | How Agent Verifies |
|------|------|-------------------|
| **Frontend/UI** | Playwright (playwright skill) | Navigate, interact, assert DOM, screenshot |
| **TUI/CLI** | interactive_bash (tmux) | Run command, send keystrokes, validate output |
| **API/Backend** | Bash (curl/httpie) | Send requests, parse responses, assert fields |
| **Library/Module** | Bash (bun/node REPL) | Import, call functions, compare output |
| **Config/Infra** | Bash (shell commands) | Apply config, run state checks, validate |
| Type | Verification Tool | Automated Procedure | **Each Scenario MUST Follow This Format:**
|------|------------------|---------------------|
| **Frontend/UI** | Playwright browser via playwright skill | Agent navigates, clicks, screenshots, asserts DOM state |
| **TUI/CLI** | interactive_bash (tmux) | Agent runs command, captures output, validates expected strings |
| **API/Backend** | curl / httpie via Bash | Agent sends request, parses response, validates JSON fields |
| **Library/Module** | Node/Python REPL via Bash | Agent imports, calls function, compares output |
| **Config/Infra** | Shell commands via Bash | Agent applies config, runs state check, validates output |
**Evidence Requirements (Agent-Executable):** \`\`\`
- Command output captured and compared against expected patterns Scenario: [Descriptive name what user action/flow is being verified]
- Screenshots saved to .sisyphus/evidence/ for visual verification Tool: [Playwright / interactive_bash / Bash]
- JSON response fields validated with specific assertions Preconditions: [What must be true before this scenario runs]
- Exit codes checked (0 = success) Steps:
1. [Exact action with specific selector/command/endpoint]
2. [Next action with expected intermediate state]
3. [Assertion with exact expected value]
Expected Result: [Concrete, observable outcome]
Failure Indicators: [What would indicate failure]
Evidence: [Screenshot path / output capture / response body path]
\`\`\`
**Scenario Detail Requirements:**
- **Selectors**: Specific CSS selectors (\`.login-button\`, not "the login button")
- **Data**: Concrete test data (\`"test@example.com"\`, not \`"[email]"\`)
- **Assertions**: Exact values (\`text contains "Welcome back"\`, not "verify it works")
- **Timing**: Include wait conditions where relevant (\`Wait for .dashboard (timeout: 10s)\`)
- **Negative Scenarios**: At least ONE failure/error scenario per feature
- **Evidence Paths**: Specific file paths (\`.sisyphus/evidence/task-N-scenario-name.png\`)
**Anti-patterns (NEVER write scenarios like this):**
- "Verify the login page works correctly"
- "Check that the API returns the right data"
- "Test the form validation"
- "User opens browser and confirms..."
**Write scenarios like this instead:**
- \`Navigate to /login → Fill input[name="email"] with "test@example.com" → Fill input[name="password"] with "Pass123!" → Click button[type="submit"] → Wait for /dashboard → Assert h1 contains "Welcome"\`
- \`POST /api/users {"name":"Test","email":"new@test.com"} → Assert status 201 → Assert response.id is UUID → GET /api/users/{id} → Assert name equals "Test"\`
- \`Run ./cli --config test.yaml → Wait for "Loaded" in stdout → Send "q" → Assert exit code 0 → Assert stdout contains "Goodbye"\`
**Evidence Requirements:**
- Screenshots: \`.sisyphus/evidence/\` for all UI verifications
- Terminal output: Captured for CLI/TUI verifications
- Response bodies: Saved for API verifications
- All evidence referenced by specific file path in acceptance criteria
--- ---
@ -242,76 +281,115 @@ Parallel Speedup: ~40% faster than sequential
**Acceptance Criteria**: **Acceptance Criteria**:
> **CRITICAL: AGENT-EXECUTABLE VERIFICATION ONLY** > **AGENT-EXECUTABLE VERIFICATION ONLY** No human action permitted.
> > Every criterion MUST be verifiable by running a command or using a tool.
> - Acceptance = EXECUTION by the agent, not "user checks if it works" > REPLACE all placeholders with actual values from task context.
> - Every criterion MUST be verifiable by running a command or using a tool
> - NO steps like "user opens browser", "user clicks", "user confirms"
> - If you write "[placeholder]" - REPLACE IT with actual values based on task context
**If TDD (tests enabled):** **If TDD (tests enabled):**
- [ ] Test file created: src/auth/login.test.ts - [ ] Test file created: src/auth/login.test.ts
- [ ] Test covers: successful login returns JWT token - [ ] Test covers: successful login returns JWT token
- [ ] bun test src/auth/login.test.ts PASS (3 tests, 0 failures) - [ ] bun test src/auth/login.test.ts PASS (3 tests, 0 failures)
**Automated Verification (ALWAYS include, choose by deliverable type):** **Agent-Executed QA Scenarios (MANDATORY per-scenario, ultra-detailed):**
> Write MULTIPLE named scenarios per task: happy path AND failure cases.
> Each scenario = exact tool + steps with real selectors/data + evidence path.
**Example Frontend/UI (Playwright):**
**For Frontend/UI changes** (using playwright skill):
\\\`\\\`\\\` \\\`\\\`\\\`
# Agent executes via playwright browser automation: Scenario: Successful login redirects to dashboard
1. Navigate to: http://localhost:3000/login Tool: Playwright (playwright skill)
2. Fill: input[name="email"] with "test@example.com" Preconditions: Dev server running on localhost:3000, test user exists
3. Fill: input[name="password"] with "password123" Steps:
4. Click: button[type="submit"] 1. Navigate to: http://localhost:3000/login
5. Wait for: selector ".dashboard-welcome" to be visible 2. Wait for: input[name="email"] visible (timeout: 5s)
6. Assert: text "Welcome back" appears on page 3. Fill: input[name="email"] "test@example.com"
7. Screenshot: .sisyphus/evidence/task-1-login-success.png 4. Fill: input[name="password"] "ValidPass123!"
5. Click: button[type="submit"]
6. Wait for: navigation to /dashboard (timeout: 10s)
7. Assert: h1 text contains "Welcome back"
8. Assert: cookie "session_token" exists
9. Screenshot: .sisyphus/evidence/task-1-login-success.png
Expected Result: Dashboard loads with welcome message
Evidence: .sisyphus/evidence/task-1-login-success.png
Scenario: Login fails with invalid credentials
Tool: Playwright (playwright skill)
Preconditions: Dev server running, no valid user with these credentials
Steps:
1. Navigate to: http://localhost:3000/login
2. Fill: input[name="email"] "wrong@example.com"
3. Fill: input[name="password"] "WrongPass"
4. Click: button[type="submit"]
5. Wait for: .error-message visible (timeout: 5s)
6. Assert: .error-message text contains "Invalid credentials"
7. Assert: URL is still /login (no redirect)
8. Screenshot: .sisyphus/evidence/task-1-login-failure.png
Expected Result: Error message shown, stays on login page
Evidence: .sisyphus/evidence/task-1-login-failure.png
\\\`\\\`\\\` \\\`\\\`\\\`
**For TUI/CLI changes** (using interactive_bash): **Example API/Backend (curl):**
\\\`\\\`\\\` \\\`\\\`\\\`
# Agent executes via tmux session: Scenario: Create user returns 201 with UUID
1. Command: ./my-cli --config test.yaml Tool: Bash (curl)
2. Wait for: "Configuration loaded" in output Preconditions: Server running on localhost:8080
3. Send keys: "q" to quit Steps:
4. Assert: Exit code 0 1. curl -s -w "\\n%{http_code}" -X POST http://localhost:8080/api/users \\
5. Assert: Output contains "Goodbye" -H "Content-Type: application/json" \\
-d '{"email":"new@test.com","name":"Test User"}'
2. Assert: HTTP status is 201
3. Assert: response.id matches UUID format
4. GET /api/users/{returned-id} Assert name equals "Test User"
Expected Result: User created and retrievable
Evidence: Response bodies captured
Scenario: Duplicate email returns 409
Tool: Bash (curl)
Preconditions: User with email "new@test.com" already exists
Steps:
1. Repeat POST with same email
2. Assert: HTTP status is 409
3. Assert: response.error contains "already exists"
Expected Result: Conflict error returned
Evidence: Response body captured
\\\`\\\`\\\` \\\`\\\`\\\`
**For API/Backend changes** (using Bash curl): **Example TUI/CLI (interactive_bash):**
\\\`\\\`\\\`bash
# Agent runs:
curl -s -X POST http://localhost:8080/api/users \\
-H "Content-Type: application/json" \\
-d '{"email":"new@test.com","name":"Test User"}' \\
| jq '.id'
# Assert: Returns non-empty UUID
# Assert: HTTP status 201
\\\`\\\`\\\`
**For Library/Module changes** (using Bash node/bun):
\\\`\\\`\\\`bash
# Agent runs:
bun -e "import { validateEmail } from './src/utils/validate'; console.log(validateEmail('test@example.com'))"
# Assert: Output is "true"
bun -e "import { validateEmail } from './src/utils/validate'; console.log(validateEmail('invalid'))"
# Assert: Output is "false"
\\\`\\\`\\\` \\\`\\\`\\\`
Scenario: CLI loads config and displays menu
Tool: interactive_bash (tmux)
Preconditions: Binary built, test config at ./test.yaml
Steps:
1. tmux new-session: ./my-cli --config test.yaml
2. Wait for: "Configuration loaded" in output (timeout: 5s)
3. Assert: Menu items visible ("1. Create", "2. List", "3. Exit")
4. Send keys: "3" then Enter
5. Assert: "Goodbye" in output
6. Assert: Process exited with code 0
Expected Result: CLI starts, shows menu, exits cleanly
Evidence: Terminal output captured
**For Config/Infra changes** (using Bash): Scenario: CLI handles missing config gracefully
\\\`\\\`\\\`bash Tool: interactive_bash (tmux)
# Agent runs: Preconditions: No config file at ./nonexistent.yaml
docker compose up -d Steps:
# Wait 5s for containers 1. tmux new-session: ./my-cli --config nonexistent.yaml
docker compose ps --format json | jq '.[].State' 2. Wait for: output (timeout: 3s)
# Assert: All states are "running" 3. Assert: stderr contains "Config file not found"
4. Assert: Process exited with code 1
Expected Result: Meaningful error, non-zero exit
Evidence: Error output captured
\\\`\\\`\\\` \\\`\\\`\\\`
**Evidence to Capture:** **Evidence to Capture:**
- [ ] Terminal output from verification commands (actual output, not expected) - [ ] Screenshots in .sisyphus/evidence/ for UI scenarios
- [ ] Screenshot files in .sisyphus/evidence/ for UI changes - [ ] Terminal output for CLI/TUI scenarios
- [ ] JSON response bodies for API changes - [ ] Response bodies for API scenarios
- [ ] Each evidence file named: task-{N}-{scenario-slug}.{ext}
**Commit**: YES | NO (groups with N) **Commit**: YES | NO (groups with N)
- Message: \`type(scope): desc\` - Message: \`type(scope): desc\`