mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-05-13 18:00:35 +08:00
docs: add deep-analyzer evaluator scenario
This commit is contained in:
parent
337ced0828
commit
37c27a60fd
@ -201,7 +201,7 @@ is not complete unless the evidence column exists and has been freshly verified.
|
|||||||
| AgentShield enterprise iteration | Policy gates, SARIF, packs, provenance, corpus, HTML reports, exception lifecycle audit | PRs #53, #55-#62 landed with test evidence | Needs PDF/export decision or next enterprise signal |
|
| AgentShield enterprise iteration | Policy gates, SARIF, packs, provenance, corpus, HTML reports, exception lifecycle audit | PRs #53, #55-#62 landed with test evidence | Needs PDF/export decision or next enterprise signal |
|
||||||
| ECC Tools next-level app | Billing audit, PR checks, deep analyzer, sync backlog | PRs #26-#39 landed with test evidence | Needs capacity-backed Linear rollout / broader evaluator corpus |
|
| ECC Tools next-level app | Billing audit, PR checks, deep analyzer, sync backlog | PRs #26-#39 landed with test evidence | Needs capacity-backed Linear rollout / broader evaluator corpus |
|
||||||
| GitGuardian/Dependabot/CodeRabbit-style checks | Non-blocking taxonomy and deterministic follow-up checks | ECC-Tools risk taxonomy check plus follow-up signals landed, including Skill Quality, Deep Analyzer Evidence, Analyzer Corpus Evidence, RAG/Evaluator Evidence, and PR Review/Salvage Evidence | Partially complete |
|
| GitGuardian/Dependabot/CodeRabbit-style checks | Non-blocking taxonomy and deterministic follow-up checks | ECC-Tools risk taxonomy check plus follow-up signals landed, including Skill Quality, Deep Analyzer Evidence, Analyzer Corpus Evidence, RAG/Evaluator Evidence, and PR Review/Salvage Evidence | Partially complete |
|
||||||
| Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md` and `examples/evaluator-rag-prototype/` define read-only stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, AgentShield policy-exception, and skill-quality evidence scenarios with trace, report, playbook, and verifier result artifacts | Needs deep-analyzer corpus |
|
| Harness-agnostic learning system | Audit, adapter matrix, observability, traces, promotion loop | Audit/adapters/observability gates plus `docs/architecture/evaluator-rag-prototype.md` and `examples/evaluator-rag-prototype/` define read-only stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, AgentShield policy-exception, skill-quality evidence, and deep-analyzer evidence scenarios with trace, report, playbook, and verifier result artifacts | Local corpus complete; hosted integration remains future |
|
||||||
| Linear roadmap is detailed | Linear project status plus repo mirror | Repo mirror exists; issue creation was retried on 2026-05-12 and remains blocked by the workspace free issue limit | Needs recurring status updates after each merge batch |
|
| Linear roadmap is detailed | Linear project status plus repo mirror | Repo mirror exists; issue creation was retried on 2026-05-12 and remains blocked by the workspace free issue limit | Needs recurring status updates after each merge batch |
|
||||||
| Flow separation and progress tracking | Flow lanes with owner artifacts and update cadence | This roadmap defines lanes below | Active |
|
| Flow separation and progress tracking | Flow lanes with owner artifacts and update cadence | This roadmap defines lanes below | Active |
|
||||||
| Realtime Linear sync | Project updates while issue limit is blocked; issues later | ECC-Tools #39 implements opt-in Linear API sync for deferred follow-up backlog items | Needs workspace capacity/config rollout |
|
| Realtime Linear sync | Project updates while issue limit is blocked; issues later | ECC-Tools #39 implements opt-in Linear API sync for deferred follow-up backlog items | Needs workspace capacity/config rollout |
|
||||||
@ -220,7 +220,7 @@ back to the repo evidence and merge commits.
|
|||||||
| Queue hygiene and salvage | GitHub PR/issue state, salvage ledger | Append ledger entries for any future stale closures | Every cleanup batch |
|
| Queue hygiene and salvage | GitHub PR/issue state, salvage ledger | Append ledger entries for any future stale closures | Every cleanup batch |
|
||||||
| Release and publication | rc.1 release docs, publication readiness doc | Naming matrix and plugin submission/contact checklist | Before any tag |
|
| Release and publication | rc.1 release docs, publication readiness doc | Naming matrix and plugin submission/contact checklist | Before any tag |
|
||||||
| Harness OS core | Audit, adapter matrix, observability docs, `ecc2/` | HUD/session-control acceptance spec | Weekly until GA |
|
| Harness OS core | Audit, adapter matrix, observability docs, `ecc2/` | HUD/session-control acceptance spec | Weekly until GA |
|
||||||
| Evaluation and RAG | Reference-set validation, harness audit, traces | Read-only evaluator/RAG prototype plus stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, AgentShield policy-exception, and skill-quality evidence fixtures | Expand to deep-analyzer evidence scenario |
|
| Evaluation and RAG | Reference-set validation, harness audit, traces | Read-only evaluator/RAG prototype plus stale-salvage, billing-readiness, CI-failure-diagnosis, harness-config-quality, AgentShield policy-exception, skill-quality evidence, and deep-analyzer evidence fixtures | Use as fixture contract before hosted retrieval/check-run automation |
|
||||||
| AgentShield enterprise | AgentShield PR evidence and roadmap notes | PDF-export decision or next enterprise signal | After value decision |
|
| AgentShield enterprise | AgentShield PR evidence and roadmap notes | PDF-export decision or next enterprise signal | After value decision |
|
||||||
| ECC Tools app | ECC-Tools PR evidence, billing audit, risk taxonomy | Capacity-backed Linear rollout or broader evaluator/RAG corpus slice | Next implementation batch |
|
| ECC Tools app | ECC-Tools PR evidence, billing audit, risk taxonomy | Capacity-backed Linear rollout or broader evaluator/RAG corpus slice | Next implementation batch |
|
||||||
| Linear progress | Linear project status updates and this mirror | Status update with queue/evidence/missing gates | Every significant merge batch |
|
| Linear progress | Linear project status updates and this mirror | Status update with queue/evidence/missing gates | Every significant merge batch |
|
||||||
@ -419,6 +419,6 @@ Acceptance:
|
|||||||
executive report, corpus benchmark output, and exception lifecycle audit.
|
executive report, corpus benchmark output, and exception lifecycle audit.
|
||||||
2. Enable/configure the merged Linear backlog sync path after workspace issue
|
2. Enable/configure the merged Linear backlog sync path after workspace issue
|
||||||
capacity clears or the Linear workspace is upgraded.
|
capacity clears or the Linear workspace is upgraded.
|
||||||
3. Expand the evaluator/RAG corpus beyond stale-salvage, billing, CI,
|
3. Consume the local evaluator/RAG corpus from ECC Tools before adding hosted
|
||||||
harness-config, AgentShield policy-exception, and skill-quality evidence
|
retrieval, vector storage, model-backed judging, or automated check-run
|
||||||
prototypes toward deep-analyzer evidence scenarios.
|
promotion.
|
||||||
|
|||||||
@ -19,7 +19,9 @@ exception scenario gates security exceptions on SARIF/report evidence, owner
|
|||||||
fields, expiry state, and remediation-versus-exception decisions. A
|
fields, expiry state, and remediation-versus-exception decisions. A
|
||||||
skill-quality evidence scenario requires observed failure or feedback evidence,
|
skill-quality evidence scenario requires observed failure or feedback evidence,
|
||||||
working examples, reference-set gaps, and validation commands before a skill
|
working examples, reference-set gaps, and validation commands before a skill
|
||||||
amendment can be promoted.
|
amendment can be promoted. A deep-analyzer evidence scenario requires analyzer
|
||||||
|
corpus cases, expected-output comparisons, and risk-taxonomy proof before
|
||||||
|
repository or commit-analysis behavior can change.
|
||||||
|
|
||||||
## Reference Pressure
|
## Reference Pressure
|
||||||
|
|
||||||
@ -116,6 +118,9 @@ Current corpus:
|
|||||||
- `skill-quality-evidence`: requires focused skill scope, observed failure or
|
- `skill-quality-evidence`: requires focused skill scope, observed failure or
|
||||||
user-feedback evidence, examples/reference-set coverage, validation commands,
|
user-feedback evidence, examples/reference-set coverage, validation commands,
|
||||||
and publication safety before a skill amendment can be promoted.
|
and publication safety before a skill amendment can be promoted.
|
||||||
|
- `deep-analyzer-evidence`: requires maintained analyzer corpus cases,
|
||||||
|
expected-output comparisons, representative repository/commit histories, and
|
||||||
|
regression commands before deep-analysis behavior can be promoted.
|
||||||
|
|
||||||
## ECC Tools Mapping
|
## ECC Tools Mapping
|
||||||
|
|
||||||
@ -147,7 +152,7 @@ A candidate can be promoted only when:
|
|||||||
|
|
||||||
## Next Expansion
|
## Next Expansion
|
||||||
|
|
||||||
The next evaluator/RAG corpus should add:
|
The local evaluator/RAG corpus now covers the current evidence buckets. Future
|
||||||
|
work should consume these fixtures from ECC Tools before adding hosted
|
||||||
- a deep-analyzer evidence scenario with maintained reference sets and rejected
|
retrieval, vector storage, model-backed judging, or automated check-run
|
||||||
low-evidence candidates.
|
promotion.
|
||||||
|
|||||||
@ -0,0 +1,60 @@
|
|||||||
|
# Deep Analyzer Evidence Playbook
|
||||||
|
|
||||||
|
Candidate id: `corpus-backed-analyzer-change`
|
||||||
|
|
||||||
|
Use this playbook when a PR changes repository analysis, commit analysis,
|
||||||
|
architecture classification, workflow detection, pattern detection, or
|
||||||
|
deep-analysis risk-taxonomy behavior.
|
||||||
|
|
||||||
|
## Accepted Path
|
||||||
|
|
||||||
|
1. Name the changed analyzer surface and source file.
|
||||||
|
2. Retrieve the Deep Analyzer Evidence contract from `../ECC-Tools/README.md`
|
||||||
|
and the follow-up logic in `../ECC-Tools/src/lib/analyzer.ts`.
|
||||||
|
3. Match the change to maintained corpus or reference evidence:
|
||||||
|
- `../ECC-Tools/src/analyzers/fixtures/deep-analyzer-corpus.ts`
|
||||||
|
- `../ECC-Tools/src/analyzers/deep-analyzer-corpus.test.ts`
|
||||||
|
- `../ECC-Tools/src/lib/analyzer.compare.test.ts`
|
||||||
|
4. Compare expected outputs for the affected behavior:
|
||||||
|
- folder type;
|
||||||
|
- module organization;
|
||||||
|
- test location;
|
||||||
|
- primary language;
|
||||||
|
- commit message type;
|
||||||
|
- detected workflow names.
|
||||||
|
5. Add or update analyzer corpus, expected-output snapshots, fixtures,
|
||||||
|
benchmarks, golden cases, evals, or reference sets for the same changed
|
||||||
|
surface.
|
||||||
|
6. Run the relevant validation gate from `../ECC-Tools/`:
|
||||||
|
- `npm test -- src/analyzers/deep-analyzer-corpus.test.ts src/lib/analyzer.compare.test.ts`
|
||||||
|
- `npm run typecheck`
|
||||||
|
- `npm run lint`
|
||||||
|
7. Record the corpus case, expected-output comparison, validation output, and
|
||||||
|
rollback notes in the maintainer PR body or handoff.
|
||||||
|
|
||||||
|
## Rejected Path
|
||||||
|
|
||||||
|
Do not promote analyzer threshold, classification, or risk-taxonomy changes
|
||||||
|
without corpus, snapshot, fixture, benchmark, golden, eval, or reference-set
|
||||||
|
evidence.
|
||||||
|
|
||||||
|
Do not suppress the `Deep Analyzer Evidence` PR-risk bucket just because the
|
||||||
|
change is small. Suppress it only when co-located evidence covers the same
|
||||||
|
analyzer surface.
|
||||||
|
|
||||||
|
Do not rely only on broad manual review notes. Analyzer changes need
|
||||||
|
representative repository shapes or commit-history cases with expected outputs.
|
||||||
|
|
||||||
|
Do not post PR comments, create check runs, sync Linear, publish packages, edit
|
||||||
|
plugins, or create release artifacts from the evaluator run.
|
||||||
|
|
||||||
|
## Minimum Validation
|
||||||
|
|
||||||
|
- `npm test -- src/analyzers/deep-analyzer-corpus.test.ts src/lib/analyzer.compare.test.ts`
|
||||||
|
- `npm run typecheck`
|
||||||
|
- `npm run lint`
|
||||||
|
- `git diff --check`
|
||||||
|
- Markdown lint when docs or playbooks are touched
|
||||||
|
|
||||||
|
Preserve source attribution for analyzer evidence and include rollback guidance
|
||||||
|
for the future maintainer PR.
|
||||||
@ -0,0 +1,35 @@
|
|||||||
|
{
|
||||||
|
"schema_version": "ecc.evaluator-rag.report.v1",
|
||||||
|
"scenario_id": "deep-analyzer-evidence",
|
||||||
|
"run_id": "2026-05-12-deep-analyzer-evidence-prototype",
|
||||||
|
"result": "prototype_passed",
|
||||||
|
"read_only": true,
|
||||||
|
"scores": {
|
||||||
|
"corpus_retrieval": 0.95,
|
||||||
|
"expected_output_comparison": 0.91,
|
||||||
|
"representative_case_coverage": 0.89,
|
||||||
|
"taxonomy_gap_safety": 0.93,
|
||||||
|
"publication_safety": 1
|
||||||
|
},
|
||||||
|
"findings": [
|
||||||
|
{
|
||||||
|
"id": "corpus-required",
|
||||||
|
"severity": "warning",
|
||||||
|
"summary": "Deep-analysis behavior changes need maintained corpus, snapshot, fixture, benchmark, golden, eval, or reference-set evidence before promotion."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "expected-output-required",
|
||||||
|
"severity": "warning",
|
||||||
|
"summary": "Analyzer changes should compare expected folder type, module organization, test location, primary language, commit pattern, or workflow outputs."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "read-only-routing",
|
||||||
|
"severity": "info",
|
||||||
|
"summary": "The evaluator can recommend a maintainer PR but cannot post PR comments, check runs, Linear sync updates, packages, plugins, or release actions itself."
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"recommended_next_action": {
|
||||||
|
"candidate_id": "corpus-backed-analyzer-change",
|
||||||
|
"action": "Use the promoted deep-analyzer evidence playbook for PRs that change repository, commit, architecture, workflow, pattern, or risk-taxonomy analysis behavior."
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -0,0 +1,57 @@
|
|||||||
|
{
|
||||||
|
"schema_version": "ecc.evaluator-rag.scenario.v1",
|
||||||
|
"scenario_id": "deep-analyzer-evidence",
|
||||||
|
"title": "Require analyzer corpus evidence before promoting deep-analysis changes",
|
||||||
|
"mode": "read_only_prototype",
|
||||||
|
"objective": "Given a change to repository, commit, architecture, pattern, or deep-analysis logic, retrieve maintained analyzer corpus evidence and expected-output comparisons before promoting analyzer behavior or risk-taxonomy changes.",
|
||||||
|
"sources": [
|
||||||
|
{
|
||||||
|
"kind": "sibling_repo_doc",
|
||||||
|
"path": "../ECC-Tools/README.md",
|
||||||
|
"purpose": "Public description of deep-analyzer predictive follow-ups and the Deep Analyzer Evidence PR-risk bucket"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"kind": "sibling_repo_source",
|
||||||
|
"path": "../ECC-Tools/src/lib/analyzer.ts",
|
||||||
|
"purpose": "Predictive follow-up logic that flags analyzer changes without corpus, snapshot, fixture, or benchmark evidence"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"kind": "sibling_repo_source",
|
||||||
|
"path": "../ECC-Tools/src/lib/pr-risk-taxonomy.ts",
|
||||||
|
"purpose": "Non-blocking PR-risk taxonomy bucket for deep-analyzer evidence"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"kind": "sibling_repo_fixture",
|
||||||
|
"path": "../ECC-Tools/src/analyzers/fixtures/deep-analyzer-corpus.ts",
|
||||||
|
"purpose": "Maintained corpus cases for representative repository shapes, commit histories, and expected analyzer outputs"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"kind": "sibling_repo_test",
|
||||||
|
"command": "npm test -- src/analyzers/deep-analyzer-corpus.test.ts src/lib/analyzer.compare.test.ts",
|
||||||
|
"purpose": "Regression evidence for analyzer corpus outputs and deep-analyzer follow-up generation"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"retrieval_questions": [
|
||||||
|
"Which analyzer surface changed: repository structure, architecture, code style, commit messages, workflow detection, pattern detection, or risk taxonomy?",
|
||||||
|
"Which maintained corpus case or reference set covers the same analyzer behavior?",
|
||||||
|
"Do expected outputs compare folder type, module organization, test location, primary language, commit type, and workflow names?",
|
||||||
|
"Does the PR add analyzer corpus, snapshot, fixture, benchmark, golden, eval, or reference-set evidence alongside analyzer code changes?",
|
||||||
|
"Does the evaluator keep PR comments, check runs, Linear sync, package changes, and publication actions out of the read-only pass?"
|
||||||
|
],
|
||||||
|
"forbidden_actions": [
|
||||||
|
"promoting repository, commit, architecture, or deep-analysis changes without analyzer corpus evidence",
|
||||||
|
"suppressing the Deep Analyzer Evidence risk bucket without co-located corpus, snapshot, fixture, or benchmark evidence",
|
||||||
|
"changing analyzer thresholds or classifications without expected-output comparison",
|
||||||
|
"relying only on broad manual review notes instead of representative repository and commit-history cases",
|
||||||
|
"posting PR comments, check runs, or Linear sync updates from this read-only evaluator run",
|
||||||
|
"changing package, plugin, release, or publication state from this evaluator run"
|
||||||
|
],
|
||||||
|
"acceptance_gates": [
|
||||||
|
"changed analyzer surface is named",
|
||||||
|
"maintained corpus or reference-set path is included",
|
||||||
|
"expected analyzer outputs are compared",
|
||||||
|
"representative repository shape or commit history is described",
|
||||||
|
"regression command is named",
|
||||||
|
"at least one no-corpus analyzer change is rejected"
|
||||||
|
]
|
||||||
|
}
|
||||||
@ -0,0 +1,45 @@
|
|||||||
|
{
|
||||||
|
"schema_version": "ecc.evaluator-rag.trace.v1",
|
||||||
|
"scenario_id": "deep-analyzer-evidence",
|
||||||
|
"run_id": "2026-05-12-deep-analyzer-evidence-prototype",
|
||||||
|
"read_only": true,
|
||||||
|
"events": [
|
||||||
|
{
|
||||||
|
"phase": "observation",
|
||||||
|
"summary": "A deep-analysis PR changes repository, commit, architecture, workflow, pattern, or risk-taxonomy behavior. The evaluator records the touched analyzer surface and remains read-only.",
|
||||||
|
"evidence": [
|
||||||
|
"../ECC-Tools/src/lib/analyzer.ts",
|
||||||
|
"../ECC-Tools/src/lib/pr-risk-taxonomy.ts"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"phase": "retrieval",
|
||||||
|
"summary": "Retrieved the maintained analyzer corpus, corpus regression test, and follow-up tests that distinguish corpus-backed analyzer changes from no-evidence analyzer rewrites.",
|
||||||
|
"evidence": [
|
||||||
|
"../ECC-Tools/src/analyzers/fixtures/deep-analyzer-corpus.ts",
|
||||||
|
"../ECC-Tools/src/analyzers/deep-analyzer-corpus.test.ts",
|
||||||
|
"../ECC-Tools/src/lib/analyzer.compare.test.ts"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"phase": "proposal",
|
||||||
|
"summary": "Generated two candidate playbooks: corpus-backed analyzer change, and threshold-only analyzer rewrite without expected-output evidence.",
|
||||||
|
"candidate_ids": [
|
||||||
|
"corpus-backed-analyzer-change",
|
||||||
|
"threshold-only-analyzer-rewrite"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"phase": "verification",
|
||||||
|
"summary": "Accepted the corpus-backed analyzer change because it names representative repository/commit cases and expected-output comparisons. Rejected the threshold-only rewrite because it lacks corpus or benchmark evidence.",
|
||||||
|
"evidence": [
|
||||||
|
"examples/evaluator-rag-prototype/deep-analyzer-evidence/verifier-result.json"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"phase": "promotion",
|
||||||
|
"summary": "Promoted only the read-only deep-analyzer evidence playbook. Future analyzer edits must move through maintainer PRs with corpus evidence, regression commands, and rollback notes.",
|
||||||
|
"promoted_candidate_id": "corpus-backed-analyzer-change"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
@ -0,0 +1,35 @@
|
|||||||
|
{
|
||||||
|
"schema_version": "ecc.evaluator-rag.verifier.v1",
|
||||||
|
"scenario_id": "deep-analyzer-evidence",
|
||||||
|
"run_id": "2026-05-12-deep-analyzer-evidence-prototype",
|
||||||
|
"read_only": true,
|
||||||
|
"candidates": [
|
||||||
|
{
|
||||||
|
"candidate_id": "corpus-backed-analyzer-change",
|
||||||
|
"decision": "accepted",
|
||||||
|
"score": 0.92,
|
||||||
|
"reasons": [
|
||||||
|
"names the changed analyzer surface and matching maintained corpus case",
|
||||||
|
"compares expected analyzer outputs for representative repository and commit-history inputs",
|
||||||
|
"keeps Deep Analyzer Evidence taxonomy behavior tied to co-located corpus or benchmark evidence",
|
||||||
|
"names the regression command that exercises corpus and follow-up behavior",
|
||||||
|
"keeps PR comments, check runs, Linear sync, and publication actions out of the evaluator run"
|
||||||
|
],
|
||||||
|
"rollback": "Revert the future analyzer PR and restore the prior corpus expectations; no hosted check-run, Linear, package, or publication state changes in this read-only playbook."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"candidate_id": "threshold-only-analyzer-rewrite",
|
||||||
|
"decision": "rejected",
|
||||||
|
"score": 0.13,
|
||||||
|
"reasons": [
|
||||||
|
"changes analyzer thresholds without corpus evidence",
|
||||||
|
"does not compare expected outputs against representative repository or commit-history cases",
|
||||||
|
"does not update analyzer corpus, snapshot, fixture, benchmark, golden, eval, or reference-set artifacts",
|
||||||
|
"would suppress Deep Analyzer Evidence risk without proof",
|
||||||
|
"does not name a regression command"
|
||||||
|
],
|
||||||
|
"rollback": "Do not promote this analyzer rewrite; restart from maintained corpus inputs, expected-output snapshots, and a focused maintainer PR."
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"promoted_candidate_id": "corpus-backed-analyzer-change"
|
||||||
|
}
|
||||||
@ -130,12 +130,12 @@ test('candidate playbook preserves stale-salvage operating rules', () => {
|
|||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
|
||||||
test('roadmap points to the evaluator RAG prototype and keeps broader corpus work open', () => {
|
test('roadmap points to the evaluator RAG prototype and keeps hosted integration open', () => {
|
||||||
const roadmap = read('docs/ECC-2.0-GA-ROADMAP.md');
|
const roadmap = read('docs/ECC-2.0-GA-ROADMAP.md');
|
||||||
|
|
||||||
assert.ok(roadmap.includes('docs/architecture/evaluator-rag-prototype.md'));
|
assert.ok(roadmap.includes('docs/architecture/evaluator-rag-prototype.md'));
|
||||||
assert.ok(roadmap.includes('examples/evaluator-rag-prototype/'));
|
assert.ok(roadmap.includes('examples/evaluator-rag-prototype/'));
|
||||||
assert.ok(roadmap.includes('Needs deep-analyzer corpus'));
|
assert.ok(roadmap.includes('Local corpus complete; hosted integration remains future'));
|
||||||
});
|
});
|
||||||
|
|
||||||
test('billing readiness scenario rejects launch copy overclaims', () => {
|
test('billing readiness scenario rejects launch copy overclaims', () => {
|
||||||
@ -361,6 +361,54 @@ test('skill quality evidence scenario rejects vague rewrites', () => {
|
|||||||
assert.ok(playbook.includes('observed skill-run failure'));
|
assert.ok(playbook.includes('observed skill-run failure'));
|
||||||
});
|
});
|
||||||
|
|
||||||
|
test('deep analyzer evidence scenario rejects no-corpus analyzer changes', () => {
|
||||||
|
const scenario = readFixtureJson('deep-analyzer-evidence/scenario.json');
|
||||||
|
const trace = readFixtureJson('deep-analyzer-evidence/trace.json');
|
||||||
|
const report = readFixtureJson('deep-analyzer-evidence/report.json');
|
||||||
|
const verifier = readFixtureJson('deep-analyzer-evidence/verifier-result.json');
|
||||||
|
const playbook = read('examples/evaluator-rag-prototype/deep-analyzer-evidence/candidate-playbook.md');
|
||||||
|
|
||||||
|
assert.strictEqual(scenario.scenario_id, 'deep-analyzer-evidence');
|
||||||
|
assert.strictEqual(trace.scenario_id, scenario.scenario_id);
|
||||||
|
assert.strictEqual(report.scenario_id, scenario.scenario_id);
|
||||||
|
assert.strictEqual(verifier.scenario_id, scenario.scenario_id);
|
||||||
|
assert.strictEqual(trace.read_only, true);
|
||||||
|
assert.strictEqual(report.read_only, true);
|
||||||
|
assert.strictEqual(verifier.read_only, true);
|
||||||
|
|
||||||
|
for (const blocked of [
|
||||||
|
'promoting repository, commit, architecture, or deep-analysis changes without analyzer corpus evidence',
|
||||||
|
'suppressing the Deep Analyzer Evidence risk bucket without co-located corpus, snapshot, fixture, or benchmark evidence',
|
||||||
|
'changing analyzer thresholds or classifications without expected-output comparison',
|
||||||
|
'posting PR comments, check runs, or Linear sync updates from this read-only evaluator run'
|
||||||
|
]) {
|
||||||
|
assert.ok(scenario.forbidden_actions.includes(blocked), `Missing deep-analyzer forbidden action: ${blocked}`);
|
||||||
|
}
|
||||||
|
|
||||||
|
for (const required of [
|
||||||
|
'changed analyzer surface is named',
|
||||||
|
'maintained corpus or reference-set path is included',
|
||||||
|
'expected analyzer outputs are compared',
|
||||||
|
'representative repository shape or commit history is described',
|
||||||
|
'regression command is named'
|
||||||
|
]) {
|
||||||
|
assert.ok(scenario.acceptance_gates.includes(required), `Missing deep-analyzer acceptance gate: ${required}`);
|
||||||
|
}
|
||||||
|
|
||||||
|
const accepted = verifier.candidates.find(candidate => candidate.candidate_id === 'corpus-backed-analyzer-change');
|
||||||
|
const rejected = verifier.candidates.find(candidate => candidate.candidate_id === 'threshold-only-analyzer-rewrite');
|
||||||
|
|
||||||
|
assert.ok(accepted, 'Missing accepted deep-analyzer candidate');
|
||||||
|
assert.ok(rejected, 'Missing rejected threshold-only analyzer candidate');
|
||||||
|
assert.strictEqual(accepted.decision, 'accepted');
|
||||||
|
assert.strictEqual(rejected.decision, 'rejected');
|
||||||
|
assert.strictEqual(verifier.promoted_candidate_id, accepted.candidate_id);
|
||||||
|
assert.ok(rejected.reasons.join('\n').includes('does not compare expected outputs'));
|
||||||
|
assert.ok(playbook.includes('../ECC-Tools/src/analyzers/fixtures/deep-analyzer-corpus.ts'));
|
||||||
|
assert.ok(playbook.includes('npm test -- src/analyzers/deep-analyzer-corpus.test.ts src/lib/analyzer.compare.test.ts'));
|
||||||
|
assert.ok(playbook.includes('Deep Analyzer Evidence'));
|
||||||
|
});
|
||||||
|
|
||||||
if (failed > 0) {
|
if (failed > 0) {
|
||||||
console.log(`\nFailed: ${failed}`);
|
console.log(`\nFailed: ${failed}`);
|
||||||
process.exit(1);
|
process.exit(1);
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user