mirror of
https://github.com/ultraworkers/claw-code.git
synced 2026-04-25 21:54:09 +08:00
3 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
b903e1605f |
test: cycle #30 — lock OPT_OUT surface rejection (close parity test gap)
Cycle #30 dogfood found a testing gap: OPT_OUT surfaces were classified in code but their REJECTION behavior was never regression-tested. ## The Gap OPT_OUT_AUDIT.md declares 12 surfaces as intentionally exempt from --output-format. The test suite had: - ✅ test_clawable_surface_has_output_format (CLAWABLE must accept) - ✅ test_every_registered_command_is_classified (no orphans) - ❌ Nothing verifying OPT_OUT surfaces REJECT --output-format If a developer accidentally added --output-format to 'summary' (one of the 12 OPT_OUT surfaces), no test would catch the silent promotion. The classification was governed, but the rejection behavior was NOT. ## What Changed Added TestOptOutSurfaceRejection to test_cli_parity_audit.py with 14 tests: 1. **12 parametrized tests** — one per OPT_OUT surface, verifying each rejects --output-format with an argparse error. 2. **test_opt_out_set_matches_audit_document** — verifies OPT_OUT_SURFACES constant matches the declared 12 surfaces in OPT_OUT_AUDIT.md. 3. **test_opt_out_count_matches_declared** — sanity check that the count stays at 12 as documented. ## Symmetry Achieved Before: only CLAWABLE acceptance tested CLAWABLE accepts --output-format ✅ OPT_OUT behavior: untested After: full parity coverage CLAWABLE accepts --output-format ✅ OPT_OUT rejects --output-format ✅ Audit doc ↔ constant kept in sync ✅ This completes the parity enforcement loop: every new surface is explicitly IN or OUT, and BOTH directions are regression-locked. ## Promotion Path Preserved When a real OPT_OUT surface gains genuine demand (per OPT_OUT_DEMAND_LOG.md): 1. Move from OPT_OUT_SURFACES to CLAWABLE_SURFACES 2. Update OPT_OUT_AUDIT.md with promotion rationale 3. Remove from this test's expected rejections 4. Tests pass (rejection test no longer runs; acceptance test now required) Graceful promotion; no accidental drift. ## Test Count - 222 → 236 passing (+14, zero regressions) - 12 parametrized + 2 metadata = 14 new tests ## Discipline Check Per cycle #24 calibration: - Red-state bug? ✗ (no broken behavior) - Real friction? ✓ (testing gap discovered by dogfood) - Evidence-backed? ✓ (systematic probe revealed missing coverage) This is the cycle #27 taxonomy (structural / quality / cross-channel / text-vs-JSON divergence) extending into classification: not just 'is the envelope right?' but 'is the OPPOSITE-OF-envelope right?' Future cycles can apply the same principle to other classifications: every governed non-goal deserves regression tests that lock its non-goal-ness. Classification: - Real friction: ✓ (cycle #30 dogfood) - Evidence-backed: ✓ (gap discovered by systematic surface audit) - Same-cycle fix: ✓ (maintainership discipline) Source: Jobdori cycle #30 proactive dogfood — probed all 26 subcommands with --output-format json and noticed OPT_OUT rejection pattern was unverified by any dedicated test. |
||
|
|
11f9e8a5a2 |
feat: #164 Stage B CLOSURE — turn-loop JSON + cancel_observed coverage + CLAWABLE promotion
Closes all three gaebal-gajae-identified closure criteria for #164 Stage B: 1. turn-loop runtime surface exposes cancel_observed consistently 2. cancellation path tests validate safe-to-reuse semantics 3. turn-loop promoted from OPT_OUT to CLAWABLE surface Changes: src/main.py: - turn-loop accepts --output-format {text,json} - JSON envelope includes per-turn cancel_observed + final_cancel_observed - All turn fields exposed: prompt, output, stop_reason, cancel_observed, matched_commands, matched_tools - Exit code 2 on final timeout preserved tests/test_cli_parity_audit.py: - CLAWABLE_SURFACES now contains 14 commands (was 13) - Removed 'turn-loop' from OPT_OUT_SURFACES - Parametrized --output-format test auto-validates turn-loop JSON tests/test_cancel_observed_field.py (new, 9 tests): - TestCancelObservedField (5 tests): field contract - default False - explicit True preserved - normal completion → False - bootstrap JSON exposes field - turn-loop JSON exposes per-turn field - TestCancelObservedSafeReuseSemantics (2 tests): reuse contract - timeout result has cancel_observed=True when signaled - engine.mutable_messages not corrupted after cancelled turn - engine accepts fresh message after cancellation - TestCancelObservedSchemaCompliance (2 tests): SCHEMAS.md contract - cancel_observed is always bool - final_cancel_observed convenience field present Closure criteria validated: - ✅ Field exposed in bootstrap JSON - ✅ Field exposed per-turn in turn-loop JSON - ✅ Field is always bool, never null - ✅ Safe-to-reuse: engine can accept fresh messages after cancellation - ✅ mutable_messages not corrupted by cancelled turn - ✅ turn-loop promoted from OPT_OUT (14 clawable commands now) Protocol now distinguishes at runtime: timeout + cancel_observed=false → infra/wedge (escalate) timeout + cancel_observed=true → cooperative cancellation (safe to retry) Test results: 182 → 192 passing, +10 tests, zero regression, 3 skipped unchanged. Closes #164 Stage B. Stage C (async-native preemption) remains future work. |
||
|
|
b048de8899 |
fix: #171 — automate cross-surface CLI parity audit via argparse introspection
Stops manual parity inspection from being a human-noticed concern. When
a developer adds a new subcommand to the claw-code CLI, this test suite
enforces explicit classification:
- CLAWABLE_SURFACES: MUST accept --output-format {text,json}
- OPT_OUT_SURFACES: explicitly exempt with documented rationale
A new command that forgets to opt into one of these two sets FAILS
loudly with TestCommandClassificationCoverage::test_every_registered_
command_is_classified. No silent drift possible.
Technique: argparse introspection at test time walks the _actions tree,
discovers every registered subcommand, and compares against the declared
classification sets. Contract is enforced machine-first instead of
depending on human review.
Three test classes covering three invariants:
TestClawableSurfaceParity (14 tests):
- test_all_clawable_surfaces_accept_output_format: every member of
CLAWABLE_SURFACES has --output-format flag registered
- test_clawable_surface_output_format_choices (parametrised over 13
commands): each must accept exactly {text, json} and default to 'text'
for backward compat
TestCommandClassificationCoverage (3 tests):
- test_every_registered_command_is_classified: any new subcommand
must be explicitly added to CLAWABLE_SURFACES or OPT_OUT_SURFACES
- test_no_command_in_both_sets: sanity check for classification conflicts
- test_all_classified_commands_actually_exist: no phantom commands
(catches stale entries after a command is removed)
TestJsonOutputContractEndToEnd (10 tests):
- test_command_emits_parseable_json (parametrised over 10 clawable
commands): actual subprocess invocation with --output-format json
produces valid parseable JSON on stdout
Classification:
CLAWABLE_SURFACES (13):
Session lifecycle: list-sessions, delete-session, load-session,
flush-transcript
Inspect: show-command, show-tool
Execution: exec-command, exec-tool, route, bootstrap
Diagnostic inventory: command-graph, tool-pool, bootstrap-graph
OPT_OUT_SURFACES (12):
Rich-Markdown reports (future JSON schema): summary, manifest,
parity-audit, setup-report
List filter commands: subsystems, commands, tools
Turn-loop: structured_output is future work
Simulation/debug: remote-mode, ssh-mode, teleport-mode,
direct-connect-mode, deep-link-mode
Full suite: 141 → 168 passing (+27), zero regression.
Closes ROADMAP #171.
Why this matters:
Before: parity was human-monitored; every new command was a drift
risk. The CLUSTER 3 sweep required manually auditing every
subcommand and landing fixes as separate pinpoints.
After: parity is machine-enforced. If a future developer adds a new
command without --output-format, the test suite blocks it
immediately with a concrete error message pointing at the
missing flag.
This is the first step in Gaebal-gajae's identified upper-level work:
operationalised parity instead of aspirational parity.
Related clusters:
- Clawability principle: machine-first protocol enforcement
- Test-first regression guard: extends TestTripletParityConsistency
(#160/#165) and TestFullFamilyParity (#166) from per-cluster
parity to cross-surface parity
|