diff --git a/docs/ECC-2.0-GA-ROADMAP.md b/docs/ECC-2.0-GA-ROADMAP.md index b6e94b92..cb39bb3c 100644 --- a/docs/ECC-2.0-GA-ROADMAP.md +++ b/docs/ECC-2.0-GA-ROADMAP.md @@ -59,7 +59,8 @@ As of 2026-05-12: self-improving harness prototype: scenario specs, traces, reports, candidate playbooks, verifier results, accepted maintainer-salvage, billing-readiness, CI-failure-diagnosis, and harness-config-quality - candidates, plus rejected unsafe candidates. + candidates, plus the AgentShield policy-exception scenario and rejected + unsafe candidates. - The npm package surface now excludes Python bytecode/cache artifacts through package `files` negation rules and a publish-surface regression test. - `docs/legacy-artifact-inventory.md` records that no `_legacy-documents-*` @@ -200,7 +201,7 @@ is not complete unless the evidence column exists and has been freshly verified. | AgentShield enterprise iteration | Policy gates, SARIF, packs, provenance, corpus, HTML reports, exception lifecycle audit | PRs #53, #55-#62 landed with test evidence | Needs PDF/export decision or next enterprise signal | | ECC Tools next-level app | Billing audit, PR checks, deep analyzer, sync backlog | PRs #26-#39 landed with test evidence | Needs capacity-backed Linear rollout / broader evaluator corpus | | GitGuardian/Dependabot/CodeRabbit-style checks | Non-blocking taxonomy and deterministic follow-up checks | ECC-Tools risk taxonomy check plus follow-up signals landed, including Skill Quality, Deep Analyzer Evidence, Analyzer Corpus Evidence, RAG/Evaluator Evidence, and PR Review/Salvage Evidence | Partially complete | -| Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md` and `examples/evaluator-rag-prototype/` define read-only stale-salvage, billing-readiness, CI-failure-diagnosis, and harness-config-quality scenarios with trace, report, playbook, and verifier result artifacts | Needs AgentShield policy exception corpus | +| Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md` and `examples/evaluator-rag-prototype/` define read-only stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, and AgentShield policy-exception scenarios with trace, report, playbook, and verifier result artifacts | Needs skill-quality and deep-analyzer corpus | | Linear roadmap is detailed | Linear project status plus repo mirror | Repo mirror exists; issue creation was retried on 2026-05-12 and remains blocked by the workspace free issue limit | Needs recurring status updates after each merge batch | | Flow separation and progress tracking | Flow lanes with owner artifacts and update cadence | This roadmap defines lanes below | Active | | Realtime Linear sync | Project updates while issue limit is blocked; issues later | ECC-Tools #39 implements opt-in Linear API sync for deferred follow-up backlog items | Needs workspace capacity/config rollout | @@ -219,7 +220,7 @@ back to the repo evidence and merge commits. | Queue hygiene and salvage | GitHub PR/issue state, salvage ledger | Append ledger entries for any future stale closures | Every cleanup batch | | Release and publication | rc.1 release docs, publication readiness doc | Naming matrix and plugin submission/contact checklist | Before any tag | | Harness OS core | Audit, adapter matrix, observability docs, `ecc2/` | HUD/session-control acceptance spec | Weekly until GA | -| Evaluation and RAG | Reference-set validation, harness audit, traces | Read-only evaluator/RAG prototype plus stale-salvage, billing-readiness, CI-failure-diagnosis, and harness-config-quality fixtures | Expand to AgentShield policy exception scenario | +| Evaluation and RAG | Reference-set validation, harness audit, traces | Read-only evaluator/RAG prototype plus stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, and AgentShield policy-exception fixtures | Expand to skill-quality or deep-analyzer evidence scenario | | AgentShield enterprise | AgentShield PR evidence and roadmap notes | PDF-export decision or next enterprise signal | After value decision | | ECC Tools app | ECC-Tools PR evidence, billing audit, risk taxonomy | Capacity-backed Linear rollout or broader evaluator/RAG corpus slice | Next implementation batch | | Linear progress | Linear project status updates and this mirror | Status update with queue/evidence/missing gates | Every significant merge batch | @@ -418,6 +419,6 @@ Acceptance: executive report, corpus benchmark output, and exception lifecycle audit. 2. Enable/configure the merged Linear backlog sync path after workspace issue capacity clears or the Linear workspace is upgraded. -3. Expand the evaluator/RAG corpus beyond the stale-salvage and billing - prototypes to CI failure diagnosis, harness-config drift, and AgentShield - policy exception scenarios. +3. Expand the evaluator/RAG corpus beyond stale-salvage, billing, CI, + harness-config, and AgentShield policy-exception prototypes toward + skill-quality and deep-analyzer evidence scenarios. diff --git a/docs/architecture/evaluator-rag-prototype.md b/docs/architecture/evaluator-rag-prototype.md index feba9d0f..d55013d7 100644 --- a/docs/architecture/evaluator-rag-prototype.md +++ b/docs/architecture/evaluator-rag-prototype.md @@ -14,7 +14,9 @@ treat dry-run release evidence or roadmap intent as live billing state. A CI-failure diagnosis scenario adds the log-first workflow needed before an agent proposes fixes for red checks. A harness-config quality scenario keeps MCP, plugin, hook, command, agent, and adapter recommendations tied to the -adapter matrix before they mutate setup guidance. +adapter matrix before they mutate setup guidance. An AgentShield policy +exception scenario gates security exceptions on SARIF/report evidence, owner +fields, expiry state, and remediation-versus-exception decisions. ## Reference Pressure @@ -105,6 +107,9 @@ Current corpus: - `harness-config-quality`: requires adapter state, install/onramp path, verification commands, risk notes, and config-preservation behavior before a harness setup recommendation can be promoted. +- `agentshield-policy-exception`: requires AgentShield SARIF or report + evidence, policy-pack source, owner/ticket/scope/expiry fields, and expired + exception enforcement before a policy exception can be promoted. ## ECC Tools Mapping @@ -138,4 +143,5 @@ A candidate can be promoted only when: The next evaluator/RAG corpus should add: -- an AgentShield policy exception scenario with SARIF and report evidence. +- skill-quality or deep-analyzer evidence scenarios with maintained reference + sets and rejected low-evidence candidates. diff --git a/examples/evaluator-rag-prototype/agentshield-policy-exception/candidate-playbook.md b/examples/evaluator-rag-prototype/agentshield-policy-exception/candidate-playbook.md new file mode 100644 index 00000000..e0b19970 --- /dev/null +++ b/examples/evaluator-rag-prototype/agentshield-policy-exception/candidate-playbook.md @@ -0,0 +1,49 @@ +# AgentShield Policy Exception Playbook + +Candidate id: `sarif-backed-timeboxed-exception-review` + +Use this playbook when AgentShield organization-policy output produces a +finding that may need remediation, a time-boxed exception, or explicit +enforcement. + +## Accepted Path + +1. Identify the AgentShield finding id, category, severity, affected file or + MCP/hook surface, and policy pack or organization baseline. +2. Retrieve scanner evidence before judgment: + - SARIF/code-scanning result, especially `agentshield-policy/*` + - JSON/HTML report evidence + - terminal or GitHub Action job-summary counts +3. Record lifecycle fields for any exception request: owner, ticket, scope, + expiry, rationale, and whether it is active, expiring soon, or expired. +4. Keep expired exceptions rejected or enforced until new evidence exists. +5. Decide whether immediate remediation is possible. If not, only promote a + narrow time-boxed exception tied to the named owner, ticket, scope, and + expiry. +6. Keep AgentShield code, policy packs, enforcement settings, release state, + and live security posture out of the read-only evaluator run. + +## Rejected Path + +Do not blanket suppress a policy category, policy pack, or organization gate +because a finding is inconvenient. + +Do not downgrade critical/high findings without SARIF or report evidence and a +current owner, ticket, scope, and expiry. + +Do not treat expired exceptions as active. Expired means the policy gate should +remain enforced until a maintainer creates a fresh, bounded exception or fixes +the underlying issue. + +## Minimum Validation + +- `npx ecc-agentshield scan --format json` +- AgentShield SARIF/code-scanning artifact or report evidence +- `npx ecc-agentshield scan --format html` when executive review evidence is + needed +- Current exception lifecycle fields: owner, ticket, scope, expiry, status +- `node tests/docs/evaluator-rag-prototype.test.js` +- `git diff --check` + +Record the scanner evidence, lifecycle state, policy-pack source, and +remediation-versus-exception decision in the maintainer PR body or handoff. diff --git a/examples/evaluator-rag-prototype/agentshield-policy-exception/report.json b/examples/evaluator-rag-prototype/agentshield-policy-exception/report.json new file mode 100644 index 00000000..98d98182 --- /dev/null +++ b/examples/evaluator-rag-prototype/agentshield-policy-exception/report.json @@ -0,0 +1,35 @@ +{ + "schema_version": "ecc.evaluator-rag.report.v1", + "scenario_id": "agentshield-policy-exception", + "run_id": "2026-05-12-agentshield-policy-exception-prototype", + "result": "prototype_passed", + "read_only": true, + "scores": { + "sarif_report_evidence": 0.95, + "exception_lifecycle": 0.93, + "ownership_specificity": 0.9, + "remediation_decision": 0.88, + "blanket_suppression_safety": 1 + }, + "findings": [ + { + "id": "sarif-report-match-required", + "severity": "warning", + "summary": "AgentShield policy exceptions must name SARIF or report evidence before a remediation or exception playbook can be promoted." + }, + { + "id": "expired-exception-enforcement", + "severity": "warning", + "summary": "Expired exceptions must remain rejected or enforced; the evaluator cannot treat stale approvals as active evidence." + }, + { + "id": "bounded-owner-fields", + "severity": "info", + "summary": "Accepted exceptions preserve owner, ticket, scope, expiry, policy-pack source, and affected surface fields." + } + ], + "recommended_next_action": { + "candidate_id": "sarif-backed-timeboxed-exception-review", + "action": "Use the promoted playbook for future AgentShield policy exception requests before changing gates, suppressing categories, or accepting security risk." + } +} diff --git a/examples/evaluator-rag-prototype/agentshield-policy-exception/scenario.json b/examples/evaluator-rag-prototype/agentshield-policy-exception/scenario.json new file mode 100644 index 00000000..cdfc4fb3 --- /dev/null +++ b/examples/evaluator-rag-prototype/agentshield-policy-exception/scenario.json @@ -0,0 +1,62 @@ +{ + "schema_version": "ecc.evaluator-rag.scenario.v1", + "scenario_id": "agentshield-policy-exception", + "title": "Gate AgentShield policy exceptions with report and SARIF evidence", + "mode": "read_only_prototype", + "objective": "Given an AgentShield organization-policy finding or proposed exception, retrieve report, SARIF, lifecycle, and ownership evidence before promoting a remediation or time-boxed exception playbook.", + "sources": [ + { + "kind": "repo_doc", + "path": "docs/ECC-2.0-GA-ROADMAP.md", + "purpose": "Durable record of AgentShield policy gates, SARIF output, policy packs, reports, corpus benchmark, and exception lifecycle audit evidence" + }, + { + "kind": "repo_command", + "path": "commands/security-scan.md", + "purpose": "ECC command contract for running AgentShield and separating scanner facts from follow-up judgment" + }, + { + "kind": "repo_skill", + "path": "skills/security-scan/SKILL.md", + "purpose": "Operator-facing AgentShield scan workflow and output-format guidance" + }, + { + "kind": "external_pr_evidence", + "repo": "affaan-m/agentshield", + "prs": [ + 55, + 56, + 57, + 59, + 60, + 62 + ], + "purpose": "Policy gate, SARIF, policy-pack, HTML report, corpus benchmark, and exception lifecycle implementation evidence" + } + ], + "retrieval_questions": [ + "Which AgentShield policy finding, category, severity, and affected file or MCP/hook surface triggered the request?", + "Is there SARIF/code-scanning evidence for an `agentshield-policy/*` result, and does it match the report finding?", + "Is the exception active, expiring soon, or expired?", + "Does the exception include owner, ticket, scope, expiry, and rationale fields?", + "Which policy pack or organization baseline produced the finding?", + "Is remediation possible now, or is a bounded exception safer than a blanket suppression?" + ], + "forbidden_actions": [ + "approving policy exceptions without SARIF or report evidence", + "treating expired exceptions as active", + "blanket-suppressing AgentShield policy packs or organization-policy gates", + "downgrading critical/high findings without owner, ticket, scope, and expiry", + "editing AgentShield code or policy files from this ECC evaluator run", + "publishing or enforcing new security policy from this read-only evaluator run" + ], + "acceptance_gates": [ + "SARIF or report evidence is named", + "finding id, category, severity, and affected surface are preserved", + "policy pack or organization baseline is named", + "owner, ticket, scope, and expiry state are recorded", + "expired exceptions stay rejected or enforced", + "remediation versus time-boxed exception decision is explicit", + "at least one blanket suppression candidate is rejected" + ] +} diff --git a/examples/evaluator-rag-prototype/agentshield-policy-exception/trace.json b/examples/evaluator-rag-prototype/agentshield-policy-exception/trace.json new file mode 100644 index 00000000..a234886d --- /dev/null +++ b/examples/evaluator-rag-prototype/agentshield-policy-exception/trace.json @@ -0,0 +1,45 @@ +{ + "schema_version": "ecc.evaluator-rag.trace.v1", + "scenario_id": "agentshield-policy-exception", + "run_id": "2026-05-12-agentshield-policy-exception-prototype", + "read_only": true, + "events": [ + { + "phase": "observation", + "summary": "A policy finding or exception request references AgentShield organization-policy output. The evaluator records the affected finding without editing AgentShield code, policy packs, or enforcement settings.", + "evidence": [ + "docs/ECC-2.0-GA-ROADMAP.md", + "commands/security-scan.md" + ] + }, + { + "phase": "retrieval", + "summary": "Retrieved SARIF/report evidence, policy-pack source, exception lifecycle state, owner, ticket, scope, expiry, and whether remediation is immediately available.", + "evidence": [ + "agentshield-policy/* SARIF result", + "AgentShield report exception counts", + "skills/security-scan/SKILL.md" + ] + }, + { + "phase": "proposal", + "summary": "Generated two candidate playbooks: SARIF-backed time-boxed exception review, and blanket policy suppression for the affected category.", + "candidate_ids": [ + "sarif-backed-timeboxed-exception-review", + "blanket-policy-suppression" + ] + }, + { + "phase": "verification", + "summary": "Accepted the evidence-backed exception review because it preserves finding details and lifecycle fields. Rejected blanket suppression because it bypasses policy gates and ignores expired exceptions.", + "evidence": [ + "examples/evaluator-rag-prototype/agentshield-policy-exception/verifier-result.json" + ] + }, + { + "phase": "promotion", + "summary": "Promoted only the read-only AgentShield policy exception playbook. The evaluator does not modify AgentShield code, policy packs, enforcement settings, release state, or live security posture.", + "promoted_candidate_id": "sarif-backed-timeboxed-exception-review" + } + ] +} diff --git a/examples/evaluator-rag-prototype/agentshield-policy-exception/verifier-result.json b/examples/evaluator-rag-prototype/agentshield-policy-exception/verifier-result.json new file mode 100644 index 00000000..bc28e1eb --- /dev/null +++ b/examples/evaluator-rag-prototype/agentshield-policy-exception/verifier-result.json @@ -0,0 +1,35 @@ +{ + "schema_version": "ecc.evaluator-rag.verifier.v1", + "scenario_id": "agentshield-policy-exception", + "run_id": "2026-05-12-agentshield-policy-exception-prototype", + "read_only": true, + "candidates": [ + { + "candidate_id": "sarif-backed-timeboxed-exception-review", + "decision": "accepted", + "score": 0.93, + "reasons": [ + "names SARIF/code-scanning or report evidence for the AgentShield finding", + "preserves finding id, category, severity, affected surface, and policy-pack source", + "records owner, ticket, scope, expiry, and active/expiring/expired lifecycle state", + "rejects expired exceptions and requires remediation or a time-boxed exception", + "keeps AgentShield code, policy packs, enforcement settings, and release actions out of the read-only evaluator run" + ], + "rollback": "Do not apply the future exception or suppression; re-run AgentShield, restore the prior organization policy, and keep the finding enforced until owner/ticket/scope/expiry evidence is current." + }, + { + "candidate_id": "blanket-policy-suppression", + "decision": "rejected", + "score": 0.11, + "reasons": [ + "has no SARIF or report evidence", + "blanket-suppresses AgentShield policy packs and organization-policy gates", + "treats expired exceptions as active", + "drops owner, ticket, scope, and expiry fields", + "would edit AgentShield or policy gate behavior from an ECC evaluator run" + ], + "rollback": "Do not suppress the policy category; restart from scanner evidence, lifecycle state, and a bounded remediation or exception request." + } + ], + "promoted_candidate_id": "sarif-backed-timeboxed-exception-review" +} diff --git a/tests/docs/evaluator-rag-prototype.test.js b/tests/docs/evaluator-rag-prototype.test.js index 9714d10f..6ecc6c6a 100644 --- a/tests/docs/evaluator-rag-prototype.test.js +++ b/tests/docs/evaluator-rag-prototype.test.js @@ -135,7 +135,7 @@ test('roadmap points to the evaluator RAG prototype and keeps broader corpus wor assert.ok(roadmap.includes('docs/architecture/evaluator-rag-prototype.md')); assert.ok(roadmap.includes('examples/evaluator-rag-prototype/')); - assert.ok(roadmap.includes('Needs AgentShield policy exception corpus')); + assert.ok(roadmap.includes('Needs skill-quality and deep-analyzer corpus')); }); test('billing readiness scenario rejects launch copy overclaims', () => { @@ -267,6 +267,53 @@ test('harness config quality scenario rejects unsupported parity claims', () => assert.ok(playbook.includes('node tests/docs/mcp-management-docs.test.js')); }); +test('AgentShield policy exception scenario rejects blanket suppression', () => { + const scenario = readFixtureJson('agentshield-policy-exception/scenario.json'); + const trace = readFixtureJson('agentshield-policy-exception/trace.json'); + const report = readFixtureJson('agentshield-policy-exception/report.json'); + const verifier = readFixtureJson('agentshield-policy-exception/verifier-result.json'); + const playbook = read('examples/evaluator-rag-prototype/agentshield-policy-exception/candidate-playbook.md'); + + assert.strictEqual(scenario.scenario_id, 'agentshield-policy-exception'); + assert.strictEqual(trace.scenario_id, scenario.scenario_id); + assert.strictEqual(report.scenario_id, scenario.scenario_id); + assert.strictEqual(verifier.scenario_id, scenario.scenario_id); + assert.strictEqual(trace.read_only, true); + assert.strictEqual(report.read_only, true); + assert.strictEqual(verifier.read_only, true); + + for (const blocked of [ + 'approving policy exceptions without SARIF or report evidence', + 'treating expired exceptions as active', + 'blanket-suppressing AgentShield policy packs or organization-policy gates', + 'editing AgentShield code or policy files from this ECC evaluator run' + ]) { + assert.ok(scenario.forbidden_actions.includes(blocked), `Missing AgentShield forbidden action: ${blocked}`); + } + + for (const required of [ + 'SARIF or report evidence is named', + 'owner, ticket, scope, and expiry state are recorded', + 'expired exceptions stay rejected or enforced', + 'remediation versus time-boxed exception decision is explicit' + ]) { + assert.ok(scenario.acceptance_gates.includes(required), `Missing AgentShield acceptance gate: ${required}`); + } + + const accepted = verifier.candidates.find(candidate => candidate.candidate_id === 'sarif-backed-timeboxed-exception-review'); + const rejected = verifier.candidates.find(candidate => candidate.candidate_id === 'blanket-policy-suppression'); + + assert.ok(accepted, 'Missing accepted AgentShield exception candidate'); + assert.ok(rejected, 'Missing rejected blanket suppression candidate'); + assert.strictEqual(accepted.decision, 'accepted'); + assert.strictEqual(rejected.decision, 'rejected'); + assert.strictEqual(verifier.promoted_candidate_id, accepted.candidate_id); + assert.ok(rejected.reasons.join('\n').includes('blanket-suppresses')); + assert.ok(playbook.includes('agentshield-policy/*')); + assert.ok(playbook.includes('owner, ticket, scope, expiry')); + assert.ok(playbook.includes('npx ecc-agentshield scan --format json')); +}); + if (failed > 0) { console.log(`\nFailed: ${failed}`); process.exit(1);