Cycle #22 ships documentation that operationalizes cycles #178–#179.
Problem context:
After #178 (parse-error envelope) and #179 (stderr hygiene + real error message),
claws can now build a unified error handler for all 14 clawable commands.
But there was no guide on how to actually do that. Operators had the pieces;
they didn't have the pattern.
This file changes that.
New file: ERROR_HANDLING.md
- Quick reference: exit codes + envelope shapes (0=success, 1=error, 2=timeout)
- One-handler pattern: ~80 lines of Python showing how to parse error.kind,
check retryable, and decide recovery strategy
- Four practical recovery patterns:
- Retry on transient errors (filesystem, timeout)
- Reuse session after timeout (if cancel_observed=true)
- Validate command syntax before dispatch (dry-run --help)
- Log errors for observability
- Error kinds enumeration (parse, session_not_found, filesystem, runtime, timeout)
- Common mistakes to avoid (6 patterns with BAD vs GOOD examples)
- Testing your error handler (unit test examples)
Operational impact:
Orchestration code now has a canonical pattern. Claws can:
- Copy-paste the run_claw_command() function (works for all commands)
- Classify errors uniformly (no special cases per command)
- Decide recovery deterministically (error.kind + retryable + cancel_observed)
- Log/monitor/escalate with confidence
Related cycles:
- #178: Parse-error envelope (commands now emit structured JSON on invalid argv)
- #179: Stderr hygiene + real message (JSON mode silences argparse, carries actual error)
- #164 Stage B: cancel_observed field (callers know if session is safe for reuse)
Updated CLAUDE.md:
- Added ERROR_HANDLING.md to 'Related docs' section
- Now documents the one-handler pattern as a guideline
No code changes. No test changes. Pure documentation.
This completes the documentation trail from protocol (SCHEMAS.md) →
governance (OPT_OUT_AUDIT.md, OPT_OUT_DEMAND_LOG.md) → practice (ERROR_HANDLING.md).
Cycle #21 ships governance infrastructure, not implementation. Maintainership
mode means sometimes the right deliverable is a decision framework, not code.
Problem context:
OPT_OUT_AUDIT.md (cycle #18 bonus) established 'demand-backed audit' as the
next step. But without a structured way to record demand signals, 'demand-backed'
was just a slogan — the next audit cycle would have no evidence to work from.
This commit creates the evidentiary base:
New file: OPT_OUT_DEMAND_LOG.md
- Per-surface entries for all 12 OPT_OUT commands (Groups A/B/C)
- Current state: 0 signals across all surfaces (consistent with audit prediction)
- Signal entry template with required fields:
- Source (who/what)
- Use case (concrete orchestration problem)
- Markdown-alternative-checked (why existing output insufficient)
- Date
- Promotion thresholds:
- 2+ independent signals for same surface → file promotion pinpoint
- 1 signal + existing stable schema → file pinpoint for discussion
- 0 signals → stays OPT_OUT (rationale preserved)
Decision framework for cycle #22 (audit close):
- If 0 signals total: move to PERMANENTLY_OPT_OUT, close audit
- If 1-2 signals: file individual promotion pinpoints with evidence
- If 3+ signals: reopen audit, question classification itself
Updated files:
- OPT_OUT_AUDIT.md: Added demand log reference in Related section
- CLAUDE.md: Added prerequisites for promotions (must have logged signals),
added 'File a demand signal' workflow section
Philosophy:
'Prevent speculative expansion' — schema bloat protection discipline.
Every new CLAWABLE surface is a maintenance tax. Evidence requirement keeps
the protocol lean. OPT_OUT surfaces are intentionally not-clawable until
proven otherwise by external demand.
Operational impact:
Next cycles can now:
1. Watch for real claws hitting OPT_OUT surface limits
2. Log signals in structured format (no ad-hoc filing)
3. Run audit at cycle #22 with actual data, not speculation
No code changes. No test changes. Pure governance infrastructure.
Related: #18 cycle (OPT_OUT_AUDIT.md), maintainership phase transition.
Dogfood discovered #178 had two residual gaps:
1. Stderr pollution: argparse usage + error text still leaked to stderr even in
JSON mode (envelope was correct on stdout, but stderr noise broke the
'machine-first protocol' contract — claws capturing both streams got dual output)
2. Generic error message: envelope carried 'invalid command or argument (argparse
rejection)' instead of argparse's actual text like 'the following arguments
are required: session_id' or 'invalid choice: typo (choose from ...)'
Before #179:
$ claw load-session --output-format json
[stdout] {"error": {"message": "invalid command or argument (argparse rejection)"}}
[stderr] usage: main.py load-session [-h] ...
main.py load-session: error: the following arguments are required: session_id
[exit 1]
After #179:
$ claw load-session --output-format json
[stdout] {"error": {"message": "the following arguments are required: session_id"}}
[stderr] (empty)
[exit 1]
Implementation:
- New _ArgparseError exception class captures argparse's real message
- main() monkey-patches parser.error (+ all subparser.error) in JSON mode to raise
_ArgparseError instead of print-to-stderr + sys.exit(2)
- _emit_parse_error_envelope() now receives the real message verbatim
- Text mode path unchanged: still uses original argparse print+exit behavior
Contract:
- JSON mode: stdout carries envelope with argparse's actual error; stderr silent
- Text mode: unchanged — argparse usage to stderr, exit 2
- Parse errors still error.kind='parse', retryable=false
Test additions (5 new, 14 total in test_parse_error_envelope.py):
- TestParseErrorStderrHygiene (5):
- test_json_mode_stderr_is_silent_on_unknown_command
- test_json_mode_stderr_is_silent_on_missing_arg
- test_json_mode_envelope_carries_real_argparse_message
- test_json_mode_envelope_carries_invalid_choice_details (verifies valid-choices list)
- test_text_mode_stderr_preserved_on_unknown_command (backward compat)
Operational impact:
Claws capturing both stdout and stderr no longer get garbled output. The envelope
message now carries discoverability info (valid command list, missing-arg name)
that claws can use for retry/recovery without probing the CLI a second time.
Test results: 201 → 206 passing, 3 skipped unchanged, zero regression.
Pinpoint discovered via dogfood at 2026-04-22 20:30 KST (cycle #20).
Filed explicit decision criteria for the 12 OPT_OUT surfaces (commands that do
not support --output-format json) documented in test_cli_parity_audit.py.
Categorized by rationale:
- Group A (4): Rich-Markdown reports (summary, manifest, parity-audit, setup-report)
Markdown-as-output is intentional; JSON would be information loss.
Unlikely promotions (remain OPT_OUT long-term).
- Group B (3): List filters with --query/--limit (subsystems, commands, tools)
Query layer already exists; users have escape hatch.
Remain OPT_OUT (promotion effort >> value).
- Group C (5): Simulation/debug surfaces (remote-mode, ssh-mode, teleport-mode,
direct-connect-mode, deep-link-mode)
Intentionally non-production; JSON output doesn't add value.
Remain OPT_OUT (simulation tools, not orchestration endpoints).
Audit workflow documented:
1. Survey: Check if external claws actually request JSON versions
2. Cost estimate: Schema + tests for each surface
3. Value estimate: Real demand vs hypothetical
4. Decision: CLAWABLE, remain OPT_OUT, or new pinpoint
Promotion criteria locked (only if clear use case + schema simple + demand exists).
Outcome prediction: All 12 likely remain OPT_OUT (documented rationale per group).
Timeline: Survey period (cycles #19–#21), final decision (cycle #22).
Related pinpoints: #175 (summary/manifest JSON parallel?), #176 (--query-json?),
#177 (mode simulators ever CLAWABLE?).
This closes the documentation loop from cycles #173–#174 (protocol closure →
field evolution → reframe). Now governance rules are explicit for future work.
#164 Stage B requires exposing whether cancellation was observed at the
turn-result level. This commit adds the infrastructure field:
Changes:
- TurnResult.cancel_observed: bool = False (query_engine.py)
- _build_timeout_result() accepts cancel_observed parameter (runtime.py)
- Two timeout paths now pass cancel_event.is_set() to signal observation (runtime.py)
- bootstrap command includes cancel_observed in turn JSON (main.py)
- SCHEMAS.md documents Turn Result Fields with cancel_observed contract
Usage:
When a turn timeout occurs, cancel_observed=true indicates that the
engine observed the cancellation event being set. This allows callers
to distinguish:
- timeout with no cancel → infrastructure/network stall
- timeout with cancel observed → cooperative cancellation was triggered
Backward compat:
- Existing TurnResult construction without cancel_observed defaults to False
- bootstrap JSON output still validates per SCHEMAS.md (new field is always present)
Test results: 182 passing, 3 skipped, zero regression.
Related: #161 (wall-clock timeout), #164 (cancellation observability protocol)
ROADMAP continues #164 with Stage C (test coverage for cancellation + turn envelope).
Adds parametrised test suite validating that clawable-surface commands'
JSON output matches their declared envelope contracts per SCHEMAS.md.
Two phases:
Phase 1 (this commit): Consistency baseline.
- Collect ENVELOPE_CONTRACTS registry mapping each command to its
required and optional fields
- TestJsonEnvelopeConsistency: parametrised test iterates over 13
commands, invokes with --output-format json, validates that
actual JSON envelope contains all required fields
- test_envelope_field_value_types: spot-check types (int, str, list)
for consistency
Phase 2 (future #173): Common field wrapping.
- Once wrap_json_envelope() is applied, all commands will emit
timestamp, command, exit_code, output_format, schema_version
- Currently skipped via @pytest.mark.skip, these tests will activate
automatically when wrapping is implemented:
TestJsonEnvelopeCommonFieldPrep::test_all_envelopes_include_timestamp
TestJsonEnvelopeCommonFieldPrep::test_all_envelopes_include_command
TestJsonEnvelopeCommonFieldPrep::test_all_envelopes_include_exit_code_and_schema_version
Why this matters:
- #172 documented the JSON contract; this test validates it
- Currently detects when actual output diverges from SCHEMAS.md
(e.g. list-sessions emits 'count', not 'sessions_count')
- As #173 wraps commands, test suite auto-validates new common fields
- Prevents regression: accidental field removal breaks the test suite
Current status: 11 passed (consistency), 6 skipped (awaiting #173)
Full suite: 168 → 179 passing, zero regression.
Closes ROADMAP #173 prep (framework for common field validation).
Actual field wrapping remains for next cycle.
Stops manual parity inspection from being a human-noticed concern. When
a developer adds a new subcommand to the claw-code CLI, this test suite
enforces explicit classification:
- CLAWABLE_SURFACES: MUST accept --output-format {text,json}
- OPT_OUT_SURFACES: explicitly exempt with documented rationale
A new command that forgets to opt into one of these two sets FAILS
loudly with TestCommandClassificationCoverage::test_every_registered_
command_is_classified. No silent drift possible.
Technique: argparse introspection at test time walks the _actions tree,
discovers every registered subcommand, and compares against the declared
classification sets. Contract is enforced machine-first instead of
depending on human review.
Three test classes covering three invariants:
TestClawableSurfaceParity (14 tests):
- test_all_clawable_surfaces_accept_output_format: every member of
CLAWABLE_SURFACES has --output-format flag registered
- test_clawable_surface_output_format_choices (parametrised over 13
commands): each must accept exactly {text, json} and default to 'text'
for backward compat
TestCommandClassificationCoverage (3 tests):
- test_every_registered_command_is_classified: any new subcommand
must be explicitly added to CLAWABLE_SURFACES or OPT_OUT_SURFACES
- test_no_command_in_both_sets: sanity check for classification conflicts
- test_all_classified_commands_actually_exist: no phantom commands
(catches stale entries after a command is removed)
TestJsonOutputContractEndToEnd (10 tests):
- test_command_emits_parseable_json (parametrised over 10 clawable
commands): actual subprocess invocation with --output-format json
produces valid parseable JSON on stdout
Classification:
CLAWABLE_SURFACES (13):
Session lifecycle: list-sessions, delete-session, load-session,
flush-transcript
Inspect: show-command, show-tool
Execution: exec-command, exec-tool, route, bootstrap
Diagnostic inventory: command-graph, tool-pool, bootstrap-graph
OPT_OUT_SURFACES (12):
Rich-Markdown reports (future JSON schema): summary, manifest,
parity-audit, setup-report
List filter commands: subsystems, commands, tools
Turn-loop: structured_output is future work
Simulation/debug: remote-mode, ssh-mode, teleport-mode,
direct-connect-mode, deep-link-mode
Full suite: 141 → 168 passing (+27), zero regression.
Closes ROADMAP #171.
Why this matters:
Before: parity was human-monitored; every new command was a drift
risk. The CLUSTER 3 sweep required manually auditing every
subcommand and landing fixes as separate pinpoints.
After: parity is machine-enforced. If a future developer adds a new
command without --output-format, the test suite blocks it
immediately with a concrete error message pointing at the
missing flag.
This is the first step in Gaebal-gajae's identified upper-level work:
operationalised parity instead of aspirational parity.
Related clusters:
- Clawability principle: machine-first protocol enforcement
- Test-first regression guard: extends TestTripletParityConsistency
(#160/#165) and TestFullFamilyParity (#166) from per-cluster
parity to cross-surface parity
Final diagnostic surface in the JSON parity sweep: bootstrap-graph
(the runtime bootstrap/prefetch visualization) now supports --output-format.
Concrete addition:
- bootstrap-graph: --output-format {text,json}
JSON envelope:
{stages: [str], note: 'bootstrap-graph is markdown-only in this version'}
Envelope explanation: bootstrap-graph's Markdown output is rich and
textual; raw JSON embedding maintains the markdown format (split into
lines array) rather than attempting lossy structural extraction that
would lose information. This is an honest limitation in this cycle;
full JSON schema can be added in a future audit if claws require
structured bootstrap data (dependency graphs, prefetch timing, etc.).
Backward compatibility:
- Default is 'text' (Markdown unchanged)
Closes ROADMAP #170.
Related: #167, #168, #169. Diagnostic/inventory surface family is now
uniformly JSON-capable. Summary, manifest, parity-audit, setup-report,
command-graph, tool-pool, bootstrap-graph all accept --output-format.
Extends the diagnostic surface audit with the two inventory-structure
commands: command-graph (command family segmentation) and tool-pool
(assembled tool inventory). Both now expose their underlying rich
datastructures via JSON envelope.
Concrete additions:
- command-graph: --output-format {text,json}
- tool-pool: --output-format {text,json}
JSON envelope shapes:
command-graph:
{builtins_count, plugin_like_count, skill_like_count, total_count,
builtins: [{name, source_hint}],
plugin_like: [{name, source_hint}],
skill_like: [{name, source_hint}]}
tool-pool:
{simple_mode, include_mcp, tool_count,
tools: [{name, source_hint}]}
Backward compatibility:
- Default is 'text' (Markdown unchanged)
- Text output byte-identical to pre-#169
Tests (4 new, test_command_graph_tool_pool_output_format.py):
- TestCommandGraphOutputFormat (2): JSON structure + text compat
- TestToolPoolOutputFormat (2): JSON structure + text compat
Full suite: 137 → 141 passing, zero regression.
Closes ROADMAP #169.
Why this matters:
Claws auditing the codebase can now ask 'what commands exist' and
'what tools exist' and get structured, parseable answers instead of
regex-parsing Markdown headers and counting list items.
Related clusters:
- Diagnostic surfaces (#169 adds to #167/#168 work-verb parity)
- Inventory introspection (command-graph + tool-pool are the two
foundational 'what do we have?' queries)
Closes the inspect-capability parity gap: show-command and show-tool were
the only discovery/inspection CLI commands lacking --output-format support,
making them outliers in the ecosystem that already had unified JSON
contracts across list-sessions, load-session, delete-session, and
flush-transcript (#160/#165/#166).
Concrete additions:
- show-command: --output-format {text,json}
- show-tool: --output-format {text,json}
JSON envelope shape (found case):
{name, found: true, source_hint, responsibility}
JSON envelope shape (not-found case):
{name, found: false, error: {kind:'command_not_found'|'tool_not_found',
message, retryable: false}}
Exit codes:
0 = success
1 = not found
Backward compatibility:
- Default (no --output-format) is 'text' (unchanged)
- Text output byte-identical to pre-#167 (three newline-separated lines)
Tests (10 new, test_show_command_tool_output_format.py):
- TestShowCommandOutputFormat (5): found + not-found in JSON; text mode
backward compat; text is default
- TestShowToolOutputFormat (3): found + not-found in JSON; text mode
backward compat
- TestShowCommandToolFormatParity (2): both accept same flag choices;
consistent JSON envelope shape
Full suite: 114 → 124 passing, zero regression.
Closes ROADMAP #167.
Why this matters:
Before: Claws calling show-command/show-tool had to parse human-readable
prose output via regex, with no structured error signal.
After: Same envelope contract as load-session and friends: JSON-first,
typed errors, machine-parseable.
Related clusters:
- Session-lifecycle CLI parity family (#160, #165, #166, #167)
- Machine-readable error contracts (same vein as #162 atomicity + #164
cancellation state-safety: structured boundaries for orchestration)
Closes the #161 follow-up gap identified in review: wall-clock timeout
bounded caller-facing wait but did not cancel the underlying provider
thread, which could silently mutate mutable_messages / transcript_store /
permission_denials / total_usage after the caller had already observed
stop_reason='timeout'. A ghost turn committed post-deadline would poison
any session that got persisted afterwards.
Stage A scope (this commit): runtime + engine layer cooperative cancel.
Engine layer (src/query_engine.py):
- submit_message now accepts cancel_event: threading.Event | None = None
- Two safe checkpoints:
1. Entry (before max_turns / budget projection) — earliest possible return
2. Post-budget (after output synthesis, before mutation) — catches cancel
that arrives while output was being computed
- Both checkpoints return stop_reason='cancelled' with state UNCHANGED
(mutable_messages, transcript_store, permission_denials, total_usage
all preserved exactly as on entry)
- cancel_event=None preserves legacy behaviour with zero overhead (no
checkpoint checks at all)
Runtime layer (src/runtime.py):
- run_turn_loop creates one cancel_event per invocation when a deadline
is in play (and None otherwise, preserving legacy fast path)
- Passes the same event to every submit_message call across turns, so a
late cancel on turn N-1 affects turn N
- On timeout (either pre-call or mid-call), runtime explicitly calls
cancel_event.set() before future.cancel() + synthesizing the timeout
TurnResult. This upgrades #161's best-effort future.cancel() (which
only cancels not-yet-started futures) to cooperative mid-flight cancel.
Stop reason taxonomy after Stage A:
'completed' — turn committed, state mutated exactly once
'max_budget_reached' — overflow, state unchanged (#162)
'max_turns_reached' — capacity exceeded, state unchanged
'cancelled' — cancel_event observed, state unchanged (#164 Stage A)
'timeout' — synthesised by runtime, not engine (#161)
The 'cancelled' vs 'timeout' split matters:
- 'timeout' is the runtime's best-effort signal to the caller: deadline hit
- 'cancelled' is the engine's confirmation: cancel was observed + honoured
If the provider call wedges entirely (never reaches a checkpoint), the
caller still sees 'timeout' and the thread is leaked — but any NEXT
submit_message call on the same engine observes the event at entry and
returns 'cancelled' immediately, preventing ghost-turn accumulation.
This is the honest cooperative limit in Python threading land; true
preemption requires async-native provider IO (future work, not Stage A).
Tests (29 new tests, tests/test_submit_message_cancellation.py + tests/
test_run_turn_loop_cancellation.py):
Engine-layer (12 tests):
- TestCancellationBeforeCall (5): pre-set event returns 'cancelled' immediately;
mutable_messages, transcript_store, usage, permission_denials all preserved
- TestCancellationAfterBudgetCheck (1): cancel set mid-call (after projection,
before commit) still honoured; output synthesised but state untouched
- TestCancellationAfterCommit (2): post-commit cancel not observable (honest
limit) BUT next call on same engine observes it + returns 'cancelled'
- TestLegacyCallersUnchanged (3): cancel_event=None preserves #162 atomicity
+ max_turns contract with zero behaviour change
- TestCancellationVsOtherStopReasons (2): cancel precedes max_turns check;
cancel does not retroactively override a completed turn
Runtime-layer (5 tests):
- TestTimeoutPropagatesCancelEvent (3): submit_message receives a real Event
object when deadline is set; None in legacy mode; timeout actually calls
event.set() so in-flight threads observe at their next checkpoint
- TestCancelEventSharedAcrossTurns (1): same event object passed to every
turn (object identity check) — late cancel on turn N-1 must affect turn N
Regression: 3 existing timeout test mocks updated to accept cancel_event
kwarg (mocks that previously had signature (prompt, commands, tools, denials)
now have (prompt, commands, tools, denials, cancel_event=None) since runtime
passes cancel_event positionally on the timeout path).
Full suite: 97 → 114 passing, zero regression.
Closes ROADMAP #164 Stage A.
What's explicitly NOT in Stage A:
- Preemptive cancellation of wedged provider IO (requires asyncio-native
provider path; larger refactor)
- Timeout on the legacy unbounded run_turn_loop path (by design: legacy
callers opt out of cancellation entirely)
- CLI exposure of 'cancelled' as a distinct exit code (currently 'cancelled'
maps to the same stop_reason != 'completed' break condition as others;
CLI surface for cancel is a separate pinpoint if warranted)
Every 'claw flush-transcript' call without --directory writes to
.port_sessions/<uuid>.json in CWD. Without a gitignore entry, every
dogfood run leaves dozens of untracked files in the repo, masking real
changes in 'git status' output.
Now that #160/#166 ship structured session lifecycle commands and
deterministic --session-id, this directory is purely transient by
default — belongs in .gitignore.
#159: multi-turn sessions had a silent security asymmetry: denied_tools
were always empty in run_turn_loop, even though bootstrap_session inferred
them from the routed matches. Result: any tool gated as 'destructive'
(bash-family commands, rm, etc) would silently appear unblocked across all
turns in multi-turn mode, giving a false 'clean' permission picture to any
claw consuming TurnResult.permission_denials.
Fix: compute denied_tools once at loop start via _infer_permission_denials,
then pass the same denials to every submit_message call (both timeout and
legacy unbounded paths). This mirrors the existing bootstrap_session pattern.
Acceptance: run_turn_loop('run bash ls').permission_denials now matches
what bootstrap_session returns — both infer the same denials from the
routed matches. Multi-turn security posture is symmetric.
Tests (tests/test_run_turn_loop_permissions.py, 2 tests):
- test_turn_loop_surfaces_permission_denials_like_bootstrap: Symmetry
check confirming both paths infer identical denials for destructive tools
- test_turn_loop_with_continuation_preserves_denials: Denials inferred at
loop start are passed consistently to all turns; captured via mock and
verified non-empty
Full suite: 82/82 passing, zero regression.
Closes ROADMAP #159.
The #160 session-lifecycle CLI triplet was asymmetric: list-sessions and
delete-session accepted --directory + --output-format and emitted typed
JSON error envelopes, but load-session had neither flag and dumped a raw
Python traceback (including the SessionNotFoundError class name) on a
missing session.
Three concrete impacts this fix closes:
1. Alternate session-store locations (e.g. /tmp/claw-run-XXX/.port_sessions)
were unreachable via load-session; claws had to chdir or monkeypatch
DEFAULT_SESSION_DIR to work around it.
2. Not-found emitted a multi-line Python stack, not a parseable envelope.
Claws deciding retry/escalate/give-up had only exit code 1 to work with.
3. The traceback leaked 'src.session_store.SessionNotFoundError' verbatim,
coupling version-pinned claws to our internal exception class name.
Now all three triplet commands accept the same flag pair and emit the
same JSON error shape:
Success (json mode):
{"session_id": "alpha", "loaded": true, "messages_count": 3,
"input_tokens": 42, "output_tokens": 99}
Not-found:
{"session_id": "missing", "loaded": false,
"error": {"kind": "session_not_found",
"message": "session 'missing' not found in /path",
"directory": "/path", "retryable": false}}
Corrupted file:
{"session_id": "broken", "loaded": false,
"error": {"kind": "session_load_failed",
"message": "...", "directory": "/path",
"retryable": true}}
Exit code contract:
- 0 on successful load
- 1 on not-found (preserves existing $?)
- 1 on OSError/JSONDecodeError (distinct 'kind' in JSON)
Backward compat: legacy 'claw load-session ID' text output unchanged
byte-for-byte. Only new behaviour is the flags and structured error path.
Tests (tests/test_load_session_cli.py, 13 tests):
- TestDirectoryFlagParity (2): --directory works + fallback to CWD/.port_sessions
- TestOutputFormatFlagParity (2): json schema + text-mode backward compat
- TestNotFoundTypedError (2): JSON envelope on not-found; no traceback in
either mode; no internal class name leak
- TestLoadFailedDistinctFromNotFound (1): corrupted file = session_load_failed
with retryable=true, distinct from session_not_found
- TestTripletParityConsistency (6): parametrised over [list, delete, load] *
[--directory, --output-format] — explicit parity guard for future regressions
Full suite: 80/80 passing, zero regression.
Discovered via Jobdori dogfood sweep 2026-04-22 17:44 KST — ran
'claw load-session nonexistent' expecting a clean error, got a Python
traceback. Filed #165 + fixed in same commit.
Closes ROADMAP #165.
#163: run_turn_loop no longer injects f'{prompt} [turn N]' into follow-up
prompts. The suffix was never defined or interpreted anywhere — not by the
engine, not by the system prompt, not by any LLM. It looked like a real
user-typed annotation in the transcript and made replay/analysis fragile.
New behaviour:
- turn 0 submits the original prompt (unchanged)
- turn > 0 submits caller-supplied continuation_prompt if provided, else
the loop stops cleanly — no fabricated user turn
- added continuation_prompt: str | None = None parameter to run_turn_loop
- added --continuation-prompt CLI flag for claws scripting multi-turn loops
- zero '[turn' strings ever appear in mutable_messages or stdout now
Behaviour change for existing callers:
- Before: run_turn_loop(prompt, max_turns=3) submitted 3 turns
('prompt', 'prompt [turn 2]', 'prompt [turn 3]')
- After: run_turn_loop(prompt, max_turns=3) submits 1 turn ('prompt')
- To preserve old multi-turn behaviour, pass continuation_prompt='Continue.'
or any structured follow-up text
One existing timeout test (test_budget_is_cumulative_across_turns) updated
to pass continuation_prompt so the cumulative-budget contract is actually
exercised across turns instead of trivially satisfied by a one-turn loop.
#164 filed: addresses reviewer feedback on #161. The wall-clock timeout
bounds the caller-facing wait, but the underlying submit_message worker
thread keeps running and can mutate engine state after the timeout
TurnResult is returned. A cooperative cancel_event pattern is sketched in
the pinpoint; real asyncio.Task.cancel() support will come once provider
IO is async-native (larger refactor).
Tests (tests/test_run_turn_loop_continuation.py, 8 tests):
- TestNoTurnSuffixInjection (2): zero '[turn' strings in any submitted
prompt, both default and explicit-continuation paths
- TestContinuationDefaultStopsAfterTurnZero (2): default loops run exactly
one turn; engine.submit_message called exactly once despite max_turns=10
- TestExplicitContinuationBehaviour (2): turn 0 = original, turn N = continuation
verbatim; max_turns still respected
- TestCLIContinuationFlag (2): CLI default emits only '## Turn 1';
--continuation-prompt wires through to multi-turn behaviour
Full suite: 67/67 passing.
Closes ROADMAP #163. Files #164.
Previously, QueryEnginePort.submit_message() checked the token budget AFTER
appending the prompt to mutable_messages, transcript_store, and permission_denials,
and AFTER calling compact_messages_if_needed(). On overflow it set
stop_reason='max_budget_reached' but the overflow turn was already committed.
Any caller that persisted the session afterwards wrote the rejected prompt to
disk — the session was silently poisoned even though the TurnResult said the
turn never completed.
Fix:
- Restructure submit_message so the budget check early-returns BEFORE any
mutation of mutable_messages, transcript_store, permission_denials, or
total_usage.
- The returned TurnResult.usage reflects pre-call state (overflow never
advanced the usage counter).
- Normal (in-budget) path unchanged: mutation happens exactly once, at the
end, only on 'completed' results.
This closes the atomicity gap: submit_message is now either 'turn committed'
(stop_reason='completed') or 'turn rejected, state untouched'
(stop_reason in {'max_budget_reached', 'max_turns_reached'}). Callers can
safely retry with a fresh budget or a smaller prompt without worrying about
phantom committed turns from prior rejections.
Tests (tests/test_submit_message_budget.py, 10 tests):
- TestBudgetOverflowDoesNotMutate (5): mutable_messages / transcript /
permission_denials / total_usage / TurnResult.usage all pre-mutation after overflow
- TestOverflowPersistence (2): first-turn overflow persists empty session;
successful-turn-then-overflow persists only the successful turn
- TestEngineUsableAfterOverflow (2): subsequent in-budget call still works
with no residue; repeated overflows don't accumulate hidden state
- TestNormalPathStillCommits (1): regression guard — non-overflow path still
commits mutable_messages/transcript/usage as expected
Full suite: 59/59 passing, zero regression.
Blocker: none. Closes ROADMAP #162.
Previously, run_turn_loop was bounded only by max_turns (turn count). If
engine.submit_message stalled — slow provider, hung network, infinite
stream — the loop blocked indefinitely with no cancellation path. Claws
calling run_turn_loop in CI or orchestration had no reliable way to
enforce a deadline; the loop would hang until OS kill or human intervention.
Fix:
- Add timeout_seconds parameter to run_turn_loop (default None = legacy unbounded).
- When set, each submit_message call runs inside a ThreadPoolExecutor and is
bounded by the remaining wall-clock budget (total across all turns, not per-turn).
- On timeout, synthesize a TurnResult with stop_reason='timeout' carrying the
turn's prompt and routed matches so transcripts preserve orchestration context.
- Exhausted/negative budget short-circuits before calling submit_message.
- Legacy path (timeout_seconds=None) bypasses the executor entirely — zero
overhead for callers that don't opt in.
CLI:
- Added --timeout-seconds flag to 'turn-loop' command.
- Exit code 2 when the loop terminated on timeout (vs 0 for completed),
so shell scripts can distinguish 'done' from 'budget exhausted'.
Tests (tests/test_run_turn_loop_timeout.py, 6 tests):
- Legacy unbounded path unchanged (timeout_seconds=None never emits 'timeout')
- Hung submit_message aborted within budget (0.3s budget, 5s mock hang → exit <1.5s)
- Budget is cumulative across turns (0.6s budget, 0.4s per turn, not per-turn)
- timeout_seconds=0 short-circuits first turn without calling submit_message
- Negative timeout treated as exhausted (guard against caller bugs)
- Timeout TurnResult carries correct prompt, matches, UsageSummary shape
Full suite: 49/49 passing, zero regression.
Blocker: none. Closes ROADMAP #161.
- list_sessions(directory=None) -> list[str]: enumerate stored session IDs
- session_exists(session_id, directory=None) -> bool: check existence without FileNotFoundError
- delete_session(session_id, directory=None) -> bool: unlink a session file
- load_session now raises typed SessionNotFoundError (subclass of KeyError) instead of FileNotFoundError
- Claws can now manage session lifecycle without reaching past the module to glob filesystem
Closes ROADMAP #160. Acceptance: claw can call list_sessions(), session_exists(id), delete_session(id) without importing Path or knowing .port_sessions/<id>.json layout.
Reject empty --allowedTools inputs instead of treating them as an empty restriction, and surface status JSON metadata that distinguishes default unrestricted tools from flag-provided allow lists.
Confidence: high
Scope-risk: narrow
Tested: cargo test -p rusty-claude-cli rejects_empty_allowed_tools_flag -- --nocapture
Tested: cargo test -p tools allowed_tools_rejects_empty_token_lists -- --nocapture
Tested: cargo check -p rusty-claude-cli -p tools
Tested: cargo test -p rusty-claude-cli -p tools
Not-tested: full workspace cargo fmt --check is blocked by pre-existing unrelated formatting drift
Worker boot could previously stall on an interactive MCP/tool permission prompt while readiness and startup-timeout surfaces only had generic idle/no-evidence shapes. This adds a first-class blocked lifecycle state, structured event payload, startup evidence fields, and regression coverage so callers can report the exact server/tool gate instead of pane-scraping.
Constraint: ROADMAP #200 requires tool/server identity, prompt age, and session-only versus always-allow capability in status/evidence surfaces
Rejected: Treat MCP/tool prompts as trust gates | conflates distinct prompts and loses tool identity
Rejected: Leave allow-scope as pane text only | clawhip still could not classify the blocker without scraping
Confidence: high
Scope-risk: moderate
Directive: Keep tool_permission_required distinct from trust_required; downstream claws rely on server/tool payload plus allow-scope metadata
Tested: cargo test -p runtime tool_permission
Tested: cargo fmt -p runtime -- --check && cargo clippy -p runtime --all-targets -- -D warnings && cargo test -p runtime
Tested: cargo test --workspace
Not-tested: live interactive MCP permission prompt in tmux
The pull brought the branch current with origin/main while replaying local follow-up work. Conflict resolution kept the roadmap/progress additions and integrated the runtime event/trust changes with upstream's newer surfaces.
The trust allowlist now treats worktree_pattern as an additional required predicate, including the missing-worktree case, so auto-trust cannot fall back to cwd-only matching when a worktree constraint was declared. The runtime formatting cleanup keeps clippy/fmt green after the merge.
Constraint: Local branch was 109 commits behind origin/main with dirty tracked follow-up work.
Rejected: Drop the autostash after conflict resolution | keeping it preserves a reversible safety backup for unrelated recovery.
Confidence: high
Scope-risk: moderate
Directive: Do not relax worktree_pattern matching without preserving the missing-worktree regression.
Tested: git diff --cached --check; cargo fmt -p runtime -- --check; cargo clippy -p runtime --all-targets -- -D warnings; cargo test -p runtime; cargo test --workspace; architect verification approved
Not-tested: Live tmux/worker auto-trust behavior outside unit/integration tests
## Gap
#77 Phase 1 added machine-readable error kind discriminants and #156 extended
them to text-mode output. However, the hint field is still prose derived from
splitting existing error text — not a stable registry-backed remediation
contract.
Downstream claws inspecting the hint field still need to parse human wording
to decide whether to retry, escalate, or terminate.
## Fix Shape
1. Remediation registry: remediation_for(kind, operation) -> Remediation struct
with action (retry/escalate/terminate/configure), target, and stable message
2. Stable hint outputs per error class (no more prose splitting)
3. Golden fixture tests replacing split_error_hint() string hacks
## Source
gaebal-gajae dogfood sweep 2026-04-22 05:30 KST
## Problem
#77 Phase 1 added machine-readable error `kind` discriminants to JSON error
payloads. Text-mode (stderr) errors still emit prose-only output with no
structured classification.
Observability tools (log aggregators, CI error parsers) parsing stderr can't
distinguish error classes without regex-scraping the prose.
## Fix
Added `[error-kind: <class>]` prefix line to all text-mode error output.
The prefix appears before the error prose, making it immediately parseable by
line-based log tools without any substring matching.
**Examples:**
## Impact
- Stderr observers (log aggregators, CI systems) can now parse error class
from the first line without regex or substring scraping
- Same classifier function used for JSON (#77 P1) and text modes
- Text-mode output remains human-readable (error prose unchanged)
- Prefix format follows syslog/structured-logging conventions
## Tests
All 179 rusty-claude-cli tests pass. Verified on 3 different error classes.
Closes ROADMAP #156.
## Problem
All JSON error payloads had the same three-field envelope:
```json
{"type": "error", "error": "<prose with hint baked in>"}
```
Five distinct error classes were indistinguishable at the schema level:
- missing_credentials (no API key)
- missing_worker_state (no state file)
- session_not_found / session_load_failed
- cli_parse (unrecognized args)
- invalid_model_syntax
Downstream claws had to regex-scrape the prose to route failures.
## Fix
1. **Added `classify_error_kind()`** — prefix/keyword classifier that returns a
snake_case discriminant token for 12 known error classes:
`missing_credentials`, `missing_manifests`, `missing_worker_state`,
`session_not_found`, `session_load_failed`, `no_managed_sessions`,
`cli_parse`, `invalid_model_syntax`, `unsupported_command`,
`unsupported_resumed_command`, `confirmation_required`, `api_http_error`,
plus `unknown` fallback.
2. **Added `split_error_hint()`** — splits multi-line error messages into
(short_reason, optional_hint) so the runbook prose stops being stuffed
into the `error` field.
3. **Extended JSON envelope** at 4 emit sites:
- Main error sink (line ~213)
- Session load failure in resume_session
- Stub command (unsupported_command)
- Unknown resumed command (unsupported_resumed_command)
## New JSON shape
```json
{
"type": "error",
"error": "short reason (first line)",
"kind": "missing_credentials",
"hint": "Hint: export ANTHROPIC_API_KEY..."
}
```
`kind` is always present. `hint` is null when no runbook follows.
`error` now carries only the short reason, not the full multi-line prose.
## Tests
Added 2 new regression tests:
- `classify_error_kind_returns_correct_discriminants` — all 9 known classes + fallback
- `split_error_hint_separates_reason_from_runbook` — with and without hints
All 179 rusty-claude-cli tests pass. Full workspace green.
Closes ROADMAP #77 Phase 1.
## Problem
Two session error messages advertised `.claw/sessions/` as the managed-session
location, but the actual on-disk layout is `.claw/sessions/<workspace_fingerprint>/`
where the fingerprint is a 16-char FNV-1a hash of the CWD path.
Users see error messages like:
```
no managed sessions found in .claw/sessions/
```
But the real directory is:
```
.claw/sessions/8497f4bcf995fc19/
```
The error copy was a direct lie — it made workspace-fingerprint partitioning
invisible and left users confused about whether sessions were lost or just in
a different partition.
## Fix
Updated two error formatters to accept the resolved `sessions_root` path
and extract the actual workspace-fingerprint directory:
1. **format_missing_session_reference**: now shows the actual fingerprint dir
and explains that it's a workspace-specific partition
2. **format_no_managed_sessions**: now shows the actual fingerprint dir and
includes a note that sessions from other CWDs are intentionally invisible
Updated all three call sites to pass `&self.sessions_root` to the formatters.
## Examples
**Before:**
```
no managed sessions found in .claw/sessions/
```
**After:**
```
no managed sessions found in .claw/sessions/8497f4bcf995fc19/
Start `claw` to create a session, then rerun with `--resume latest`.
Note: claw partitions sessions per workspace fingerprint; sessions from other CWDs are invisible.
```
```
session not found: nonexistent-id
Hint: managed sessions live in .claw/sessions/8497f4bcf995fc19/ (workspace-specific partition).
Try `latest` for the most recent session or `/session list` in the REPL.
```
## Impact
- Users can now tell from the error message that they're looking in the right
directory (the one their current CWD maps to)
- The workspace-fingerprint partitioning stops being invisible
- Operators understand why sessions from adjacent CWDs don't appear
- Error copy matches the actual on-disk structure
## Tests
All 466 runtime tests pass. Verified on two real workspaces with actual
workspace-fingerprint directories.
Closes ROADMAP #80.
## Problem
Three interactive slash commands are documented in `claw --help` but have no
corresponding section in USAGE.md:
- `/ultraplan [task]` — Run a deep planning prompt with multi-step reasoning
- `/teleport <symbol-or-path>` — Jump to a file or symbol by searching the workspace
- `/bughunter [scope]` — Inspect the codebase for likely bugs
New users see these commands in the help output but don't know:
- What each command does
- How to use it
- When to use it vs. other commands
- What kind of results to expect
## Fix
Added new section "Advanced slash commands (Interactive REPL only)" to USAGE.md
with documentation for all three commands:
1. **`/ultraplan`** — multi-step reasoning for complex tasks
- Example: `/ultraplan refactor the auth module to use async/await`
- Output: structured plan with numbered steps and reasoning
2. **`/teleport`** — navigate to a file or symbol
- Example: `/teleport UserService`, `/teleport src/auth.rs`
- Output: file content with the requested symbol highlighted
3. **`/bughunter`** — scan for likely bugs
- Example: `/bughunter src/handlers`, `/bughunter` (all)
- Output: list of suspicious patterns with explanations
## Impact
Users can now discover these commands and understand when to use them without
having to guess or search external sources. Bridges the gap between `--help`
output and full documentation.
Also filed ROADMAP #155 documenting the gap.
Closes ROADMAP #155.
## Problem
When a user types `claw --model gpt-4` or `--model qwen-plus`, they get:
```
error: invalid model syntax: 'gpt-4'. Expected provider/model (e.g., anthropic/claude-opus-4-6) or known alias
```
USAGE.md documents that "The error message now includes a hint that names the detected env var" — but this hint does not actually exist. The user has to re-read USAGE.md or guess the correct prefix.
## Fix
Enhance `validate_model_syntax` to detect when a model name looks like it belongs to a different provider:
1. **OpenAI models** (starts with `gpt-` or `gpt_`):
```
Did you mean `openai/gpt-4`? (Requires OPENAI_API_KEY env var)
```
2. **Qwen/DashScope models** (starts with `qwen`):
```
Did you mean `qwen/qwen-plus`? (Requires DASHSCOPE_API_KEY env var)
```
3. **Grok/xAI models** (starts with `grok`):
```
Did you mean `xai/grok-3`? (Requires XAI_API_KEY env var)
```
Unrelated invalid models (e.g., `asdfgh`) do not get a spurious hint.
## Verification
- `claw --model gpt-4` → hints `openai/gpt-4` + `OPENAI_API_KEY`
- `claw --model qwen-plus` → hints `qwen/qwen-plus` + `DASHSCOPE_API_KEY`
- `claw --model grok-3` → hints `xai/grok-3` + `XAI_API_KEY`
- `claw --model asdfgh` → generic error (no hint)
## Tests
Added 3 new assertions in `parses_multiple_diagnostic_subcommands`:
- GPT model error hints openai/ prefix and OPENAI_API_KEY
- Qwen model error hints qwen/ prefix and DASHSCOPE_API_KEY
- Unrelated models don't get a spurious hint
All 177 rusty-claude-cli tests pass.
Closes ROADMAP #154.
## Problem
Users frequently ask after building:
- "Where is the claw binary?"
- "Did the build actually work?"
- "Why can't I run \`claw\` from anywhere?"
This happens because \`cargo build\` puts the binary in \`rust/target/debug/claw\`
(or \`rust/target/release/claw\`), and new users don't know:
1. Where to find it
2. How to test it
3. How to add it to PATH (optional but common follow-up)
## Fix
Added new section "Post-build: locate the binary and verify" to README covering:
1. **Binary location table:** debug vs. release, macOS/Linux vs. Windows paths
2. **Verification commands:** Test the binary with \`--help\` and \`doctor\`
3. **Three ways to add to PATH:**
- Symlink (macOS/Linux): \`ln -s ... /usr/local/bin/claw\`
- cargo install: \`cargo install --path . --force\`
- Shell profile update: add rust/target/debug to \$PATH
4. **Troubleshooting:** Common errors ("command not found", "permission denied",
debug vs. release build speed)
## Impact
New users can now:
- Find the binary immediately after build
- Run it and verify with \`claw doctor\`
- Know their options for system-wide access
Also filed ROADMAP #153 documenting the gap.
Closes ROADMAP #153.
## Problem
Users commonly type `claw doctor --json`, `claw status --json`, or
`claw system-prompt --json` expecting JSON output. These fail with
`unrecognized argument \`--json\` for subcommand` with no hint that
`--output-format json` is the correct flag.
## Discovery
Filed as #152 during 21:17 dogfood nudge. The #127 worktree contained
a more comprehensive patch but conflicted with #141 (unified --help).
On re-investigation of main, Bugs 1 and 3 from #127 are already closed
(positional arg rejection works, no double "error:" prefix). Only
Bug 2 (the `--json` hint) remained.
## Fix
Two call sites add the hint:
1. `parse_single_word_command_alias`'s diagnostic-verb suffix path:
when rest[1] == "--json", append "Did you mean \`--output-format json\`?"
2. `parse_system_prompt_options` unknown-option path: same hint when
the option is exactly `--json`.
## Verification
Before:
$ claw doctor --json
error: unrecognized argument `--json` for subcommand `doctor`
Run `claw --help` for usage.
After:
$ claw doctor --json
error: unrecognized argument `--json` for subcommand `doctor`
Did you mean `--output-format json`?
Run `claw --help` for usage.
Covers: `doctor --json`, `status --json`, `sandbox --json`,
`system-prompt --json`, and any other diagnostic verb that routes
through `parse_single_word_command_alias`.
Other unrecognized args (`claw doctor garbage`) correctly don't
trigger the hint.
## Tests
- 2 new assertions in `parses_multiple_diagnostic_subcommands`:
- `claw doctor --json` produces hint
- `claw doctor garbage` does NOT produce hint
- 177 rusty-claude-cli tests pass
- Workspace tests green
Closes ROADMAP #152.
Filed from nudge directive at 21:17 KST. Implementation exists on worktree
`jobdori-127-verb-suffix` but needs rebase due to merge with #141.
Ready for Phase 1 implementation once conflicts resolved.
## Problem
`workspace_fingerprint(path)` hashes the raw path string without
canonicalization. Two equivalent paths (e.g. `/tmp/foo` vs
`/private/tmp/foo` on macOS) produce different fingerprints and
therefore different session stores. #150 fixed the test-side symptom;
this fixes the underlying product contract.
## Discovery path
#150 fix (canonicalize in test) was a workaround. Q's ack on #150
surfaced the deeper gap: the function itself is still fragile for
any caller passing a non-canonical path:
1. Embedded callers with a raw `--data-dir` path
2. Programmatic `SessionStore::from_cwd(user_path)` calls
3. NixOS store paths, Docker bind mounts, case-insensitive normalization
The REPL's default flow happens to work because `env::current_dir()`
returns canonical paths on macOS. But any caller passing a raw path
risks silent session-store divergence.
## Fix
Canonicalize inside `SessionStore::from_cwd()` and `from_data_dir()`
before computing the fingerprint. Kept `workspace_fingerprint()` itself
as a pure function for determinism — canonicalization is the entry
point's responsibility.
```rust
let canonical_cwd = fs::canonicalize(cwd).unwrap_or_else(|_| cwd.to_path_buf());
let sessions_root = canonical_cwd.join(".claw").join("sessions").join(workspace_fingerprint(&canonical_cwd));
```
Falls back to the raw path if canonicalize fails (directory doesn't
exist yet).
## Test-side updates
Three legacy-session tests expected the non-canonical base path to
match the store's workspace_root. Updated them to canonicalize
`base` after creation — same defensive pattern as #150, now
explicit across all three tests.
## Regression test
Added `session_store_from_cwd_canonicalizes_equivalent_paths` that
creates two stores from equivalent paths (raw vs canonical) and
asserts they resolve to the same sessions_dir.
## Verification
- `cargo test -p runtime session_store_` — 9/9 pass
- `cargo test --workspace` — all green, no FAILED markers
- No behavior change for existing users (REPL default flow already
used canonical paths)
## Backward compatibility
Users on macOS who always went through `env::current_dir()`:
no hash change, sessions resume identically.
Users who ever called with a non-canonical path: hash would change,
but those sessions were already broken (couldn't be resumed from a
canonical-path cwd). Net improvement.
Closes ROADMAP #151.
## #150 Fix: resume_latest test flake
**Problem:** `resume_latest_restores_the_most_recent_managed_session` intermittently
fails when run in the workspace suite or multiple times in sequence, but passes in
isolation.
**Root cause:** `workspace_fingerprint(path)` hashes the path string without
canonicalization. On macOS, `/tmp` is a symlink to `/private/tmp`. The test
creates a temp dir via `std::env::temp_dir().join(...)` which returns
`/var/folders/...` (non-canonical). When the subprocess spawns,
`env::current_dir()` returns the canonical path `/private/var/folders/...`.
The two fingerprints differ, so the subprocess looks in
`.claw/sessions/<hash1>` while files are in `.claw/sessions/<hash2>`.
Session discovery fails.
**Fix:** Call `fs::canonicalize(&project_dir)` after creating the directory
to ensure test and subprocess use identical path representations.
**Verification:** 5 consecutive runs of the full test suite — all pass.
Previously: 5/5 failed when run in sequence.
## #246 Filing: Reminder cron outcome ambiguity (control-loop blocker)
The `clawcode-dogfood-cycle-reminder` cron times out repeatedly with no
structured feedback on whether the nudge was delivered, skipped, or died in-flight.
**Phase 1 outcome schema** — add explicit field to cron result:
- `delivered` — nudge posted to Discord
- `timed_out_before_send` — died before posting
- `timed_out_after_send` — posted but cleanup timed out
- `skipped_due_to_active_cycle` — previous cycle active
- `aborted_gateway_draining` — daemon shutdown
Assigned to gaebal-gajae (cron/orchestration domain). Unblocks trustworthy
dogfood cycle observability.
Closes ROADMAP #150. Filed ROADMAP #246.
## Problem
`runtime::config::tests::validates_unknown_top_level_keys_with_line_and_field_name`
intermittently fails during `cargo test --workspace` (witnessed during
#147 and #148 workspace runs) but passes deterministically in isolation.
Example failure from workspace run:
test result: FAILED. 464 passed; 1 failed
## Root cause
`runtime/src/config.rs::tests::temp_dir()` used nanosecond timestamp
alone for namespace isolation:
std::env::temp_dir().join(format!("runtime-config-{nanos}"))
Under parallel test execution on fast machines with coarse clock
resolution, two tests start within the same nanosecond bucket and
collide on the same path. One test's `fs::remove_dir_all(root)` then
races another's in-flight `fs::create_dir_all()`.
Other crates already solved this pattern:
- plugins::tests::temp_dir(label) — label-parameterized
- runtime::git_context::tests::temp_dir(label) — label-parameterized
runtime/src/config.rs was missed.
## Fix
Added process id + monotonically-incrementing atomic counter to the
namespace, making every callsite provably unique regardless of clock
resolution or scheduling:
static COUNTER: AtomicU64 = AtomicU64::new(0);
let pid = std::process::id();
let seq = COUNTER.fetch_add(1, Ordering::Relaxed);
std::env::temp_dir().join(format!("runtime-config-{pid}-{nanos}-{seq}"))
Chose counter+pid over the label-parameterized pattern to avoid
touching all 20 callsites in the same commit (mechanical noise with
no added safety — counter alone is sufficient).
## Verification
Before: one failure per workspace run (config test flake).
After: 5 consecutive `cargo test --workspace` runs — zero config
test failures. Only pre-existing `resume_latest` flake remains
(orthogonal, unrelated to this change).
for i in 1 2 3 4 5; do cargo test --workspace; done
# All 5 runs: config tests green. Only resume_latest flake appears.
cargo test -p runtime
# 465 passed; 0 failed
## ROADMAP.md
Added Pinpoint #149 documenting the gap, root cause, and fix.
Closes ROADMAP #149.
## Scope
Two deltas in one commit:
### #128 closure (docs)
Re-verified on main HEAD `4cb8fa0`: malformed `--model` strings already
rejected at parse time (`validate_model_syntax` in parse_args). All
historical repro cases now produce specific errors:
claw --model '' → error: model string cannot be empty
claw --model 'bad model' → error: invalid model syntax: 'bad model' contains spaces
claw --model 'sonet' → error: invalid model syntax: 'sonet'. Expected provider/model or known alias
claw --model '@invalid' → error: invalid model syntax: '@invalid'. Expected provider/model ...
claw --model 'totally-not-real-xyz' → error: invalid model syntax: ...
claw --model sonnet → ok, resolves to claude-sonnet-4-6
claw --model anthropic/claude-opus-4-6 → ok, passes through
Marked #128 CLOSED in ROADMAP with repro block. Residual provenance gap
split off as #148.
### #148 implementation
**Problem.** After #128 closure, `claw status --output-format json`
still surfaces only the resolved model string. No way for a claw to
distinguish whether `claude-sonnet-4-6` came from `--model sonnet`
(alias resolution) vs `--model claude-sonnet-4-6` (pass-through) vs
`ANTHROPIC_MODEL` env vs `.claw.json` config vs compiled-in default.
Debug forensics had to re-read argv instead of reading a structured
field. Clawhip orchestrators sending `--model` couldn't confirm the
flag was honored vs falling back to default.
**Fix.** Added two fields to status JSON envelope:
- `model_source`: "flag" | "env" | "config" | "default"
- `model_raw`: user's input before alias resolution (null on default)
Text mode appends a `Model source` line under `Model`, showing the
source and raw input (e.g. `Model source flag (raw: sonnet)`).
**Resolution order** (mirrors resolve_repl_model but with source
attribution):
1. If `--model` / `--model=` flag supplied → source: flag, raw: flag value
2. Else if ANTHROPIC_MODEL set → source: env, raw: env value
3. Else if `.claw.json` model key set → source: config, raw: config value
4. Else → source: default, raw: null
## Changes
### rust/crates/rusty-claude-cli/src/main.rs
- Added `ModelSource` enum (Flag/Env/Config/Default) with `as_str()`.
- Added `ModelProvenance` struct (resolved, raw, source) with
three constructors: `default_fallback()`, `from_flag(raw)`, and
`from_env_or_config_or_default(cli_model)`.
- Added `model_flag_raw: Option<String>` field to `CliAction::Status`.
- Parse loop captures raw input in `--model` and `--model=` arms.
- Extended `parse_single_word_command_alias` to thread
`model_flag_raw: Option<&str>` through.
- Extended `print_status_snapshot` signature to accept
`model_flag_raw: Option<&str>`. Resolves provenance at dispatch time
(flag provenance from arg; else probe env/config/default).
- Extended `status_json_value` signature with
`provenance: Option<&ModelProvenance>`. On Some, adds `model_source`
and `model_raw` fields; on None (legacy resume paths), omits them
for backward compat.
- Extended `format_status_report` signature with optional provenance.
On Some, renders `Model source` line after `Model`.
- Updated all existing callers (REPL /status, resume /status, tests)
to pass None (legacy paths don't carry flag provenance).
- Added 2 regression assertions in parse_args test covering both
`--model sonnet` and `--model=...` forms.
### ROADMAP.md
- Marked #128 CLOSED with re-verification block.
- Filed #148 documenting the provenance gap split, fix shape, and
acceptance criteria.
## Live verification
$ claw --model sonnet --output-format json status | jq '{model,model_source,model_raw}'
{"model": "claude-sonnet-4-6", "model_source": "flag", "model_raw": "sonnet"}
$ claw --output-format json status | jq '{model,model_source,model_raw}'
{"model": "claude-opus-4-6", "model_source": "default", "model_raw": null}
$ ANTHROPIC_MODEL=haiku claw --output-format json status | jq '{model,model_source,model_raw}'
{"model": "claude-haiku-4-5-20251213", "model_source": "env", "model_raw": "haiku"}
$ echo '{"model":"claude-opus-4-7"}' > .claw.json && claw --output-format json status | jq '{model,model_source,model_raw}'
{"model": "claude-opus-4-7", "model_source": "config", "model_raw": "claude-opus-4-7"}
$ claw --model sonnet status
Status
Model claude-sonnet-4-6
Model source flag (raw: sonnet)
Permission mode danger-full-access
...
## Tests
- rusty-claude-cli bin: 177 tests pass (2 new assertions for #148)
- Full workspace green except pre-existing resume_latest flake (unrelated)
Closes ROADMAP #128, #148.