docs: add deep-analyzer evaluator scenario

This commit is contained in:
Affaan Mustafa 2026-05-12 18:43:28 -04:00 committed by Affaan Mustafa
parent 337ced0828
commit 37c27a60fd
8 changed files with 297 additions and 12 deletions

View File

@ -201,7 +201,7 @@ is not complete unless the evidence column exists and has been freshly verified.
| AgentShield enterprise iteration | Policy gates, SARIF, packs, provenance, corpus, HTML reports, exception lifecycle audit | PRs #53, #55-#62 landed with test evidence | Needs PDF/export decision or next enterprise signal | | AgentShield enterprise iteration | Policy gates, SARIF, packs, provenance, corpus, HTML reports, exception lifecycle audit | PRs #53, #55-#62 landed with test evidence | Needs PDF/export decision or next enterprise signal |
| ECC Tools next-level app | Billing audit, PR checks, deep analyzer, sync backlog | PRs #26-#39 landed with test evidence | Needs capacity-backed Linear rollout / broader evaluator corpus | | ECC Tools next-level app | Billing audit, PR checks, deep analyzer, sync backlog | PRs #26-#39 landed with test evidence | Needs capacity-backed Linear rollout / broader evaluator corpus |
| GitGuardian/Dependabot/CodeRabbit-style checks | Non-blocking taxonomy and deterministic follow-up checks | ECC-Tools risk taxonomy check plus follow-up signals landed, including Skill Quality, Deep Analyzer Evidence, Analyzer Corpus Evidence, RAG/Evaluator Evidence, and PR Review/Salvage Evidence | Partially complete | | GitGuardian/Dependabot/CodeRabbit-style checks | Non-blocking taxonomy and deterministic follow-up checks | ECC-Tools risk taxonomy check plus follow-up signals landed, including Skill Quality, Deep Analyzer Evidence, Analyzer Corpus Evidence, RAG/Evaluator Evidence, and PR Review/Salvage Evidence | Partially complete |
| Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md` and `examples/evaluator-rag-prototype/` define read-only stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, AgentShield policy-exception, and skill-quality evidence scenarios with trace, report, playbook, and verifier result artifacts | Needs deep-analyzer corpus | | Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md` and `examples/evaluator-rag-prototype/` define read-only stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, AgentShield policy-exception, skill-quality evidence, and deep-analyzer evidence scenarios with trace, report, playbook, and verifier result artifacts | Local corpus complete; hosted integration remains future |
| Linear roadmap is detailed | Linear project status plus repo mirror | Repo mirror exists; issue creation was retried on 2026-05-12 and remains blocked by the workspace free issue limit | Needs recurring status updates after each merge batch | | Linear roadmap is detailed | Linear project status plus repo mirror | Repo mirror exists; issue creation was retried on 2026-05-12 and remains blocked by the workspace free issue limit | Needs recurring status updates after each merge batch |
| Flow separation and progress tracking | Flow lanes with owner artifacts and update cadence | This roadmap defines lanes below | Active | | Flow separation and progress tracking | Flow lanes with owner artifacts and update cadence | This roadmap defines lanes below | Active |
| Realtime Linear sync | Project updates while issue limit is blocked; issues later | ECC-Tools #39 implements opt-in Linear API sync for deferred follow-up backlog items | Needs workspace capacity/config rollout | | Realtime Linear sync | Project updates while issue limit is blocked; issues later | ECC-Tools #39 implements opt-in Linear API sync for deferred follow-up backlog items | Needs workspace capacity/config rollout |
@ -220,7 +220,7 @@ back to the repo evidence and merge commits.
| Queue hygiene and salvage | GitHub PR/issue state, salvage ledger | Append ledger entries for any future stale closures | Every cleanup batch | | Queue hygiene and salvage | GitHub PR/issue state, salvage ledger | Append ledger entries for any future stale closures | Every cleanup batch |
| Release and publication | rc.1 release docs, publication readiness doc | Naming matrix and plugin submission/contact checklist | Before any tag | | Release and publication | rc.1 release docs, publication readiness doc | Naming matrix and plugin submission/contact checklist | Before any tag |
| Harness OS core | Audit, adapter matrix, observability docs, `ecc2/` | HUD/session-control acceptance spec | Weekly until GA | | Harness OS core | Audit, adapter matrix, observability docs, `ecc2/` | HUD/session-control acceptance spec | Weekly until GA |
| Evaluation and RAG | Reference-set validation, harness audit, traces | Read-only evaluator/RAG prototype plus stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, AgentShield policy-exception, and skill-quality evidence fixtures | Expand to deep-analyzer evidence scenario | | Evaluation and RAG | Reference-set validation, harness audit, traces | Read-only evaluator/RAG prototype plus stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, AgentShield policy-exception, skill-quality evidence, and deep-analyzer evidence fixtures | Use as fixture contract before hosted retrieval/check-run automation |
| AgentShield enterprise | AgentShield PR evidence and roadmap notes | PDF-export decision or next enterprise signal | After value decision | | AgentShield enterprise | AgentShield PR evidence and roadmap notes | PDF-export decision or next enterprise signal | After value decision |
| ECC Tools app | ECC-Tools PR evidence, billing audit, risk taxonomy | Capacity-backed Linear rollout or broader evaluator/RAG corpus slice | Next implementation batch | | ECC Tools app | ECC-Tools PR evidence, billing audit, risk taxonomy | Capacity-backed Linear rollout or broader evaluator/RAG corpus slice | Next implementation batch |
| Linear progress | Linear project status updates and this mirror | Status update with queue/evidence/missing gates | Every significant merge batch | | Linear progress | Linear project status updates and this mirror | Status update with queue/evidence/missing gates | Every significant merge batch |
@ -419,6 +419,6 @@ Acceptance:
executive report, corpus benchmark output, and exception lifecycle audit. executive report, corpus benchmark output, and exception lifecycle audit.
2. Enable/configure the merged Linear backlog sync path after workspace issue 2. Enable/configure the merged Linear backlog sync path after workspace issue
capacity clears or the Linear workspace is upgraded. capacity clears or the Linear workspace is upgraded.
3. Expand the evaluator/RAG corpus beyond stale-salvage, billing, CI, 3. Consume the local evaluator/RAG corpus from ECC Tools before adding hosted
harness-config, AgentShield policy-exception, and skill-quality evidence retrieval, vector storage, model-backed judging, or automated check-run
prototypes toward deep-analyzer evidence scenarios. promotion.

View File

@ -19,7 +19,9 @@ exception scenario gates security exceptions on SARIF/report evidence, owner
fields, expiry state, and remediation-versus-exception decisions. A fields, expiry state, and remediation-versus-exception decisions. A
skill-quality evidence scenario requires observed failure or feedback evidence, skill-quality evidence scenario requires observed failure or feedback evidence,
working examples, reference-set gaps, and validation commands before a skill working examples, reference-set gaps, and validation commands before a skill
amendment can be promoted. amendment can be promoted. A deep-analyzer evidence scenario requires analyzer
corpus cases, expected-output comparisons, and risk-taxonomy proof before
repository or commit-analysis behavior can change.
## Reference Pressure ## Reference Pressure
@ -116,6 +118,9 @@ Current corpus:
- `skill-quality-evidence`: requires focused skill scope, observed failure or - `skill-quality-evidence`: requires focused skill scope, observed failure or
user-feedback evidence, examples/reference-set coverage, validation commands, user-feedback evidence, examples/reference-set coverage, validation commands,
and publication safety before a skill amendment can be promoted. and publication safety before a skill amendment can be promoted.
- `deep-analyzer-evidence`: requires maintained analyzer corpus cases,
expected-output comparisons, representative repository/commit histories, and
regression commands before deep-analysis behavior can be promoted.
## ECC Tools Mapping ## ECC Tools Mapping
@ -147,7 +152,7 @@ A candidate can be promoted only when:
## Next Expansion ## Next Expansion
The next evaluator/RAG corpus should add: The local evaluator/RAG corpus now covers the current evidence buckets. Future
work should consume these fixtures from ECC Tools before adding hosted
- a deep-analyzer evidence scenario with maintained reference sets and rejected retrieval, vector storage, model-backed judging, or automated check-run
low-evidence candidates. promotion.

View File

@ -0,0 +1,60 @@
# Deep Analyzer Evidence Playbook
Candidate id: `corpus-backed-analyzer-change`
Use this playbook when a PR changes repository analysis, commit analysis,
architecture classification, workflow detection, pattern detection, or
deep-analysis risk-taxonomy behavior.
## Accepted Path
1. Name the changed analyzer surface and source file.
2. Retrieve the Deep Analyzer Evidence contract from `../ECC-Tools/README.md`
and the follow-up logic in `../ECC-Tools/src/lib/analyzer.ts`.
3. Match the change to maintained corpus or reference evidence:
- `../ECC-Tools/src/analyzers/fixtures/deep-analyzer-corpus.ts`
- `../ECC-Tools/src/analyzers/deep-analyzer-corpus.test.ts`
- `../ECC-Tools/src/lib/analyzer.compare.test.ts`
4. Compare expected outputs for the affected behavior:
- folder type;
- module organization;
- test location;
- primary language;
- commit message type;
- detected workflow names.
5. Add or update analyzer corpus, expected-output snapshots, fixtures,
benchmarks, golden cases, evals, or reference sets for the same changed
surface.
6. Run the relevant validation gate from `../ECC-Tools/`:
- `npm test -- src/analyzers/deep-analyzer-corpus.test.ts src/lib/analyzer.compare.test.ts`
- `npm run typecheck`
- `npm run lint`
7. Record the corpus case, expected-output comparison, validation output, and
rollback notes in the maintainer PR body or handoff.
## Rejected Path
Do not promote analyzer threshold, classification, or risk-taxonomy changes
without corpus, snapshot, fixture, benchmark, golden, eval, or reference-set
evidence.
Do not suppress the `Deep Analyzer Evidence` PR-risk bucket just because the
change is small. Suppress it only when co-located evidence covers the same
analyzer surface.
Do not rely only on broad manual review notes. Analyzer changes need
representative repository shapes or commit-history cases with expected outputs.
Do not post PR comments, create check runs, sync Linear, publish packages, edit
plugins, or create release artifacts from the evaluator run.
## Minimum Validation
- `npm test -- src/analyzers/deep-analyzer-corpus.test.ts src/lib/analyzer.compare.test.ts`
- `npm run typecheck`
- `npm run lint`
- `git diff --check`
- Markdown lint when docs or playbooks are touched
Preserve source attribution for analyzer evidence and include rollback guidance
for the future maintainer PR.

View File

@ -0,0 +1,35 @@
{
"schema_version": "ecc.evaluator-rag.report.v1",
"scenario_id": "deep-analyzer-evidence",
"run_id": "2026-05-12-deep-analyzer-evidence-prototype",
"result": "prototype_passed",
"read_only": true,
"scores": {
"corpus_retrieval": 0.95,
"expected_output_comparison": 0.91,
"representative_case_coverage": 0.89,
"taxonomy_gap_safety": 0.93,
"publication_safety": 1
},
"findings": [
{
"id": "corpus-required",
"severity": "warning",
"summary": "Deep-analysis behavior changes need maintained corpus, snapshot, fixture, benchmark, golden, eval, or reference-set evidence before promotion."
},
{
"id": "expected-output-required",
"severity": "warning",
"summary": "Analyzer changes should compare expected folder type, module organization, test location, primary language, commit pattern, or workflow outputs."
},
{
"id": "read-only-routing",
"severity": "info",
"summary": "The evaluator can recommend a maintainer PR but cannot post PR comments, check runs, Linear sync updates, packages, plugins, or release actions itself."
}
],
"recommended_next_action": {
"candidate_id": "corpus-backed-analyzer-change",
"action": "Use the promoted deep-analyzer evidence playbook for PRs that change repository, commit, architecture, workflow, pattern, or risk-taxonomy analysis behavior."
}
}

View File

@ -0,0 +1,57 @@
{
"schema_version": "ecc.evaluator-rag.scenario.v1",
"scenario_id": "deep-analyzer-evidence",
"title": "Require analyzer corpus evidence before promoting deep-analysis changes",
"mode": "read_only_prototype",
"objective": "Given a change to repository, commit, architecture, pattern, or deep-analysis logic, retrieve maintained analyzer corpus evidence and expected-output comparisons before promoting analyzer behavior or risk-taxonomy changes.",
"sources": [
{
"kind": "sibling_repo_doc",
"path": "../ECC-Tools/README.md",
"purpose": "Public description of deep-analyzer predictive follow-ups and the Deep Analyzer Evidence PR-risk bucket"
},
{
"kind": "sibling_repo_source",
"path": "../ECC-Tools/src/lib/analyzer.ts",
"purpose": "Predictive follow-up logic that flags analyzer changes without corpus, snapshot, fixture, or benchmark evidence"
},
{
"kind": "sibling_repo_source",
"path": "../ECC-Tools/src/lib/pr-risk-taxonomy.ts",
"purpose": "Non-blocking PR-risk taxonomy bucket for deep-analyzer evidence"
},
{
"kind": "sibling_repo_fixture",
"path": "../ECC-Tools/src/analyzers/fixtures/deep-analyzer-corpus.ts",
"purpose": "Maintained corpus cases for representative repository shapes, commit histories, and expected analyzer outputs"
},
{
"kind": "sibling_repo_test",
"command": "npm test -- src/analyzers/deep-analyzer-corpus.test.ts src/lib/analyzer.compare.test.ts",
"purpose": "Regression evidence for analyzer corpus outputs and deep-analyzer follow-up generation"
}
],
"retrieval_questions": [
"Which analyzer surface changed: repository structure, architecture, code style, commit messages, workflow detection, pattern detection, or risk taxonomy?",
"Which maintained corpus case or reference set covers the same analyzer behavior?",
"Do expected outputs compare folder type, module organization, test location, primary language, commit type, and workflow names?",
"Does the PR add analyzer corpus, snapshot, fixture, benchmark, golden, eval, or reference-set evidence alongside analyzer code changes?",
"Does the evaluator keep PR comments, check runs, Linear sync, package changes, and publication actions out of the read-only pass?"
],
"forbidden_actions": [
"promoting repository, commit, architecture, or deep-analysis changes without analyzer corpus evidence",
"suppressing the Deep Analyzer Evidence risk bucket without co-located corpus, snapshot, fixture, or benchmark evidence",
"changing analyzer thresholds or classifications without expected-output comparison",
"relying only on broad manual review notes instead of representative repository and commit-history cases",
"posting PR comments, check runs, or Linear sync updates from this read-only evaluator run",
"changing package, plugin, release, or publication state from this evaluator run"
],
"acceptance_gates": [
"changed analyzer surface is named",
"maintained corpus or reference-set path is included",
"expected analyzer outputs are compared",
"representative repository shape or commit history is described",
"regression command is named",
"at least one no-corpus analyzer change is rejected"
]
}

View File

@ -0,0 +1,45 @@
{
"schema_version": "ecc.evaluator-rag.trace.v1",
"scenario_id": "deep-analyzer-evidence",
"run_id": "2026-05-12-deep-analyzer-evidence-prototype",
"read_only": true,
"events": [
{
"phase": "observation",
"summary": "A deep-analysis PR changes repository, commit, architecture, workflow, pattern, or risk-taxonomy behavior. The evaluator records the touched analyzer surface and remains read-only.",
"evidence": [
"../ECC-Tools/src/lib/analyzer.ts",
"../ECC-Tools/src/lib/pr-risk-taxonomy.ts"
]
},
{
"phase": "retrieval",
"summary": "Retrieved the maintained analyzer corpus, corpus regression test, and follow-up tests that distinguish corpus-backed analyzer changes from no-evidence analyzer rewrites.",
"evidence": [
"../ECC-Tools/src/analyzers/fixtures/deep-analyzer-corpus.ts",
"../ECC-Tools/src/analyzers/deep-analyzer-corpus.test.ts",
"../ECC-Tools/src/lib/analyzer.compare.test.ts"
]
},
{
"phase": "proposal",
"summary": "Generated two candidate playbooks: corpus-backed analyzer change, and threshold-only analyzer rewrite without expected-output evidence.",
"candidate_ids": [
"corpus-backed-analyzer-change",
"threshold-only-analyzer-rewrite"
]
},
{
"phase": "verification",
"summary": "Accepted the corpus-backed analyzer change because it names representative repository/commit cases and expected-output comparisons. Rejected the threshold-only rewrite because it lacks corpus or benchmark evidence.",
"evidence": [
"examples/evaluator-rag-prototype/deep-analyzer-evidence/verifier-result.json"
]
},
{
"phase": "promotion",
"summary": "Promoted only the read-only deep-analyzer evidence playbook. Future analyzer edits must move through maintainer PRs with corpus evidence, regression commands, and rollback notes.",
"promoted_candidate_id": "corpus-backed-analyzer-change"
}
]
}

View File

@ -0,0 +1,35 @@
{
"schema_version": "ecc.evaluator-rag.verifier.v1",
"scenario_id": "deep-analyzer-evidence",
"run_id": "2026-05-12-deep-analyzer-evidence-prototype",
"read_only": true,
"candidates": [
{
"candidate_id": "corpus-backed-analyzer-change",
"decision": "accepted",
"score": 0.92,
"reasons": [
"names the changed analyzer surface and matching maintained corpus case",
"compares expected analyzer outputs for representative repository and commit-history inputs",
"keeps Deep Analyzer Evidence taxonomy behavior tied to co-located corpus or benchmark evidence",
"names the regression command that exercises corpus and follow-up behavior",
"keeps PR comments, check runs, Linear sync, and publication actions out of the evaluator run"
],
"rollback": "Revert the future analyzer PR and restore the prior corpus expectations; no hosted check-run, Linear, package, or publication state changes in this read-only playbook."
},
{
"candidate_id": "threshold-only-analyzer-rewrite",
"decision": "rejected",
"score": 0.13,
"reasons": [
"changes analyzer thresholds without corpus evidence",
"does not compare expected outputs against representative repository or commit-history cases",
"does not update analyzer corpus, snapshot, fixture, benchmark, golden, eval, or reference-set artifacts",
"would suppress Deep Analyzer Evidence risk without proof",
"does not name a regression command"
],
"rollback": "Do not promote this analyzer rewrite; restart from maintained corpus inputs, expected-output snapshots, and a focused maintainer PR."
}
],
"promoted_candidate_id": "corpus-backed-analyzer-change"
}

View File

@ -130,12 +130,12 @@ test('candidate playbook preserves stale-salvage operating rules', () => {
} }
}); });
test('roadmap points to the evaluator RAG prototype and keeps broader corpus work open', () => { test('roadmap points to the evaluator RAG prototype and keeps hosted integration open', () => {
const roadmap = read('docs/ECC-2.0-GA-ROADMAP.md'); const roadmap = read('docs/ECC-2.0-GA-ROADMAP.md');
assert.ok(roadmap.includes('docs/architecture/evaluator-rag-prototype.md')); assert.ok(roadmap.includes('docs/architecture/evaluator-rag-prototype.md'));
assert.ok(roadmap.includes('examples/evaluator-rag-prototype/')); assert.ok(roadmap.includes('examples/evaluator-rag-prototype/'));
assert.ok(roadmap.includes('Needs deep-analyzer corpus')); assert.ok(roadmap.includes('Local corpus complete; hosted integration remains future'));
}); });
test('billing readiness scenario rejects launch copy overclaims', () => { test('billing readiness scenario rejects launch copy overclaims', () => {
@ -361,6 +361,54 @@ test('skill quality evidence scenario rejects vague rewrites', () => {
assert.ok(playbook.includes('observed skill-run failure')); assert.ok(playbook.includes('observed skill-run failure'));
}); });
test('deep analyzer evidence scenario rejects no-corpus analyzer changes', () => {
const scenario = readFixtureJson('deep-analyzer-evidence/scenario.json');
const trace = readFixtureJson('deep-analyzer-evidence/trace.json');
const report = readFixtureJson('deep-analyzer-evidence/report.json');
const verifier = readFixtureJson('deep-analyzer-evidence/verifier-result.json');
const playbook = read('examples/evaluator-rag-prototype/deep-analyzer-evidence/candidate-playbook.md');
assert.strictEqual(scenario.scenario_id, 'deep-analyzer-evidence');
assert.strictEqual(trace.scenario_id, scenario.scenario_id);
assert.strictEqual(report.scenario_id, scenario.scenario_id);
assert.strictEqual(verifier.scenario_id, scenario.scenario_id);
assert.strictEqual(trace.read_only, true);
assert.strictEqual(report.read_only, true);
assert.strictEqual(verifier.read_only, true);
for (const blocked of [
'promoting repository, commit, architecture, or deep-analysis changes without analyzer corpus evidence',
'suppressing the Deep Analyzer Evidence risk bucket without co-located corpus, snapshot, fixture, or benchmark evidence',
'changing analyzer thresholds or classifications without expected-output comparison',
'posting PR comments, check runs, or Linear sync updates from this read-only evaluator run'
]) {
assert.ok(scenario.forbidden_actions.includes(blocked), `Missing deep-analyzer forbidden action: ${blocked}`);
}
for (const required of [
'changed analyzer surface is named',
'maintained corpus or reference-set path is included',
'expected analyzer outputs are compared',
'representative repository shape or commit history is described',
'regression command is named'
]) {
assert.ok(scenario.acceptance_gates.includes(required), `Missing deep-analyzer acceptance gate: ${required}`);
}
const accepted = verifier.candidates.find(candidate => candidate.candidate_id === 'corpus-backed-analyzer-change');
const rejected = verifier.candidates.find(candidate => candidate.candidate_id === 'threshold-only-analyzer-rewrite');
assert.ok(accepted, 'Missing accepted deep-analyzer candidate');
assert.ok(rejected, 'Missing rejected threshold-only analyzer candidate');
assert.strictEqual(accepted.decision, 'accepted');
assert.strictEqual(rejected.decision, 'rejected');
assert.strictEqual(verifier.promoted_candidate_id, accepted.candidate_id);
assert.ok(rejected.reasons.join('\n').includes('does not compare expected outputs'));
assert.ok(playbook.includes('../ECC-Tools/src/analyzers/fixtures/deep-analyzer-corpus.ts'));
assert.ok(playbook.includes('npm test -- src/analyzers/deep-analyzer-corpus.test.ts src/lib/analyzer.compare.test.ts'));
assert.ok(playbook.includes('Deep Analyzer Evidence'));
});
if (failed > 0) { if (failed > 0) {
console.log(`\nFailed: ${failed}`); console.log(`\nFailed: ${failed}`);
process.exit(1); process.exit(1);