mirror of
https://github.com/ultraworkers/claw-code.git
synced 2026-04-29 00:02:01 +08:00
file: #161 — run_turn_loop has no wall-clock timeout, stalled turn blocks indefinitely
This commit is contained in:
parent
6db68a2baa
commit
b86f9b6f17
100
ROADMAP.md
100
ROADMAP.md
@ -6158,99 +6158,23 @@ load_session('nonexistent') # raises FileNotFoundError with no structured error
|
||||
**Blocker.** None.
|
||||
|
||||
**Source.** Jobdori dogfood sweep 2026-04-22 08:46 KST — inspected `src/session_store.py` public API, confirmed only `save_session` + `load_session` present, no list/delete/exists surface.
|
||||
200. **Interactive MCP/tool permission prompts are invisible blockers** — **done (verified 2026-04-27):** worker boot observation now detects interactive tool permission gates such as `Allow the omx_memory MCP server to run tool "project_memory_read"?` before generic readiness/idle handling, records `tool_permission_required` status, emits a structured `ToolPermissionPrompt` payload with server/tool identity, prompt age, allow-scope capability, and prompt preview, marks readiness snapshots as blocked, and carries `tool_permission_prompt_detected` through startup timeout evidence so the classifier returns `tool_permission_required` instead of a vague stale/idle/ready outcome. Regression coverage locks both the structured prompt-gate event metadata and startup-timeout classification paths. **Original filing below.**
|
||||
Original filing (2026-04-18): the session emitted `SessionStart hook (completed)` and `UserPromptSubmit hook (completed)`, then stalled on an interactive MCP permission gate (`Allow the omx_memory MCP server to run tool "project_memory_read"?`). From the outside this looks like a ready-but-quiet lane even though the real state is `blocked waiting for permission`. **Required fix shape:** (a) detect interactive MCP/tool permission prompts as a first-class blocked state instead of generic idle; (b) emit a typed event such as `blocked.mcp_permission` / `blocked.tool_permission` with tool/server name, prompt age, and whether the gate is session-only vs always-allow capable; (c) include this gate in startup/no-evidence evidence bundles and lane status surfaces so clawhip can say "blocked at MCP permission prompt" without pane scraping; (d) add a regression proving a prompt-gated session does not get misclassified as stale/idle/ready. **Why this matters:** prompt acceptance and startup telemetry are still incomplete if an interactive MCP gate can eat the first real action after hooks report success. Source: live dogfood session `clawcode-human` on 2026-04-18.
|
||||
|
||||
201. **`extract --model-payload` is not inspectable enough for deterministic dogfood: forced mode selection missing, and hybrid/no-snippet cases are opaque** — dogfooded 2026-04-19 from `dogfood-1776184671` against three real-repo files. `node dist/cli/index.js extract <file> --model-payload` succeeded and auto-selected `raw`, `raw`, and `hybrid`, but there is currently no CLI surface to force `raw` / `compressed` / `hybrid` for A/B comparison: `--mode raw` and `--mode compressed` both fail immediately with `Error: Unexpected extract argument: --mode`. That turns payload-shaping validation into guesswork because operators cannot ask the extractor to render the same file through each mode and compare the exact output. The opacity is worse in the observed hybrid case: the Formbricks checkbox file produced a hybrid payload with no snippets, leaving no visible explanation for why the extractor chose hybrid, what evidence it kept vs dropped, or whether the result is correct vs a silent fallback. **Required fix shape:** (a) add an explicit debug/inspection flag that forces extraction mode (`--mode raw|compressed|hybrid` or equivalent) without changing default auto-selection; (b) print/report the chosen mode and the decision reason in a machine-readable field when `--model-payload` is used; (c) when hybrid emits zero snippets, surface an explicit reason/count summary instead of making "no snippets" indistinguishable from silent loss; (d) add regression coverage on at least one real-world hybrid fixture so mode choice and snippet accounting stay stable. **Why this matters:** direct claw-code dogfood needs deterministic payload comparison to debug startup/context quality; without forced-mode inspection and snippet accounting, operators can see the outcome but not the extraction decision that produced it. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
|
||||
## Pinpoint #161. `run_turn_loop` has no wall-clock timeout — a stalled turn blocks indefinitely
|
||||
|
||||
202. **`extract --model-payload` emits `filePath` values that can walk outside the current repo root for external targets** — dogfooded 2026-04-19 from `dogfood-1776184671` while extracting files from sibling repos under `/home/bellman/Workspace/fooks-test-repos/...` with cwd anchored at the claw-code repo. In all three successful payloads (`raw`, `raw`, `hybrid`), the reported `filePath` became a relative path like `../../fooks-test-repos/...` that escapes the current repo root. Technically the path is still correct, but operationally it is a clawability gap: downstream consumers cannot tell whether this means "user intentionally extracted an external file", "path normalization leaked out of scope", or "the payload now references content outside the trusted working tree." That ambiguity is especially bad for model payloads because the `filePath` field looks like grounded provenance while actually encoding a cross-root escape. **Required fix shape:** (a) define a stable provenance contract for extracted targets outside cwd/repo root — for example an explicit `pathScope` / `targetRoot` field or an absolute-vs-relative policy instead of silently emitting `../..` escapes; (b) if relative paths are retained, add a machine-readable flag that the target is outside the current workspace/root; (c) document and test the normalization rule for sibling-repo extraction so downstream tooling does not mistake cross-root references for in-repo files; (d) add regression coverage for one in-repo fixture and one external-target fixture. **Why this matters:** model payload provenance should reduce ambiguity, not create a silent scope escape that later consumers have to reverse-engineer. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
|
||||
**Gap.** `PortRuntime.run_turn_loop` (`src/runtime.py:154`) bounds execution only by `max_turns` (a turn count). There is no wall-clock deadline or per-turn timeout. If a single `engine.submit_message` call stalls (e.g., waiting on a slow or hung external provider, a network timeout, or an infinite LLM stream), the entire turn loop hangs with no structured signal, no cancellation path, and no timeout error returned to the caller.
|
||||
|
||||
203. **Successful dogfood runs can still end in a misleading TUI/pane failure banner (`skills/list failed in TUI`, `can't find pane`)** — dogfooded 2026-04-19 from `dogfood-1776184671`. The session completed real work and produced a coherent result summary, but immediately afterward the surface emitted `Error: skills/list failed in TUI` and `can't find pane: %4766`. That creates a truth-ordering bug: the user just watched a successful run, then the final visible state looks like a transport/UI failure with no indication whether the underlying task failed, the pane disappeared after completion, or an unrelated post-run TUI refresh crashed. **Required fix shape:** (a) separate task result state from post-run TUI/skills refresh failures so a completed run cannot be visually overwritten by a secondary pane-lookup error; (b) classify missing-pane-after-completion as a typed transport/UI degradation with phase context (`post_result_refresh`, `skills_list_refresh`, etc.) instead of a generic terminal error; (c) preserve and surface the last successful task outcome even if the TUI follow-up step fails; (d) add regression coverage for the path where a pane disappears after result rendering so the session is reported as `completed_with_ui_warning` rather than plain failure. **Why this matters:** claw-code needs the final visible truth to match the actual execution truth; otherwise successful dogfood looks flaky and operators cannot tell whether to trust the result they just got. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
|
||||
**Repro (conceptual).** Wrap `engine.submit_message` with an artificial `time.sleep(9999)` and call `run_turn_loop` — it blocks forever. There is no `asyncio.wait_for`, `signal.alarm`, `concurrent.futures.TimeoutError`, or equivalent in the call path. `grep -n 'timeout\|deadline\|elapsed\|wall' src/runtime.py src/query_engine.py` returns zero results.
|
||||
|
||||
204. **Interactive work can start with updater/setup churn before the actual user task, blurring startup truth and first-action latency** — dogfooded 2026-04-19 from `clawcode-human`. Launching `omx` inside the claw-code worktree did not begin with the requested ROADMAP task; it first diverted through an update prompt (`Update available: v0.12.6 → v0.13.0. Update now? [Y/n]`), global install, full setup refresh, config rewrite/backups, notification/HUD setup, and a `Restart to use new code` notice before returning to the actual prompt. None of that was the operator’s requested work, but it consumed the critical startup window and mixed setup chatter with task-relevant execution. This creates a clawability gap: downstream observers cannot cleanly distinguish `startup succeeded and work began` from `startup mutated the environment and maybe changed the toolchain before work began`, and first-action latency gets polluted by maintenance side effects. **Required fix shape:** (a) make updater/setup detours a first-class startup phase with explicit classification (`startup.update_gate`, `startup.setup_refresh`) instead of letting them masquerade as normal task progress; (b) allow noninteractive or automation-oriented launches to suppress or defer update/setup churn until after the first user task/result boundary; (c) preserve a clean timestamped boundary between maintenance work and task work in lane events/status surfaces; (d) add regression coverage proving a prompt can start without forced updater/setup interposition when policy says "do work now." **Why this matters:** startup truth should reflect the user’s requested work, not hide it behind self-mutation and config churn that change latency, logs, and reproducibility before the first real action. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
**Impact.** A claw calling `run_turn_loop` in a CI pipeline or orchestration harness has no reliable way to enforce a deadline. The loop will hang until the OS kills the process or a human intervenes. The caller cannot distinguish "still running" from "hung" without an external watchdog.
|
||||
|
||||
205. **Direct CLI dogfood is not self-starting when build artifacts are absent (`dist/cli/index.js` missing)** — dogfooded 2026-04-19 from `dogfood-1776184671`. The intended direct check was to run `node dist/cli/index.js extract ...`, but the first attempt hit a missing built artifact and the lane had to detour through `npm ci && npm run build` before any product behavior could be exercised. That means a "run the CLI directly in a fresh worktree" path is not actually one-step dogfoodable: the operator has to know the build prerequisite, spend time satisfying it, and then mentally separate build-system failures from product-surface failures. **Required fix shape:** (a) provide a supported direct-run entrypoint that either works from source without prebuilt `dist/` artifacts or emits a product-owned guidance error that names the exact one-shot bootstrap command; (b) surface build-artifact-missing as a typed startup/dependency prerequisite state rather than a raw module/file failure; (c) document and test the fresh-worktree direct-dogfood path so `extract --help` / `extract ... --model-payload` can be exercised without archaeology; (d) if build-on-demand is the intended contract, make it explicit and deterministic instead of requiring the operator to guess `npm ci && npm run build`. **Why this matters:** direct dogfood should fail on product behavior, not on hidden local build prerequisites that blur whether the tool is broken or merely unprepared. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
|
||||
**Fix shape (~15 lines).**
|
||||
1. Add an optional `timeout_seconds: float | None = None` parameter to `run_turn_loop`.
|
||||
2. Use `concurrent.futures.ThreadPoolExecutor` + `Future.result(timeout=...)` (or `asyncio.wait_for` if the engine becomes async) to wrap each `submit_message` call.
|
||||
3. On timeout, append a sentinel `TurnResult` with `stop_reason='timeout'` and break the loop.
|
||||
4. Document the timeout contract: total wall-clock budget across all turns, not per-turn.
|
||||
|
||||
206. **`extract --help` is not a safe/local help surface: after bootstrap it can still crash into a Node stack instead of rendering usage** — dogfooded 2026-04-19 from `dogfood-1776184671`. Even after repairing the missing-build-artifact prerequisite with `npm ci && npm run build`, the next expected low-risk probe `node dist/cli/index.js extract --help` did not cleanly print command help; it dropped into a Node failure at `dist/cli/index.js:52` and emitted a stack trace under `Node.js v25.1.0`. That means the help path itself is not trustworthy as a preflight surface: operators cannot rely on `--help` to discover flags or confirm command shape before doing real work, and they have to treat a basic introspection command like a potentially crashing code path. **Required fix shape:** (a) make `extract --help` and sibling help surfaces intercept locally before any heavier runtime path that can throw; (b) if a subcommand cannot render help because build/runtime prerequisites are missing, return a product-owned guidance error instead of a raw Node stack; (c) add regression coverage that `extract --help` succeeds in both a prepared worktree and a minimally bootstrapped one; (d) preserve the contract that help/usage discovery is the safest command family, not another execution path that can explode. **Why this matters:** help commands are supposed to reduce uncertainty; if they crash, dogfooders lose the cleanest way to learn the surface and every later failure gets harder to classify. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
|
||||
**Acceptance.** `run_turn_loop(prompt, timeout_seconds=10)` raises `TimeoutError` (or returns a `TurnResult` with `stop_reason='timeout'`) within 10 seconds even if the underlying LLM call stalls indefinitely. `timeout_seconds=None` (default) preserves existing behaviour.
|
||||
|
||||
207. **Build/setup failures are being misclassified as generic missing-path shell errors in post-tool feedback** — dogfooded 2026-04-19 from `dogfood-1776184671`. When the lane attempted `node dist/cli/index.js extract --help` with no built artifact, the `PostToolUse` hook summarized it as ``Bash reported `command not found`, `permission denied`, or a missing file/path``, and later `npm run build` failed with actual TypeScript diagnostics (`TS2307: Cannot find module 'typescript'`, plus additional compile errors). Those are distinct failure classes — missing built artifact, missing dependency, and compile/typecheck red — but the feedback surface collapses them into the same mushy shell-triage bucket. That makes recovery slower because the operator has to reread raw pane output to learn whether the right next move is `npm ci`, fixing package deps, fixing TS errors, or checking file paths. **Required fix shape:** (a) classify post-tool failures with narrower machine-readable buckets such as `artifact_missing`, `dependency_missing`, `compile_error`, and reserve `missing_path` / `command_not_found` for the literal cases; (b) include the strongest observed diagnostic snippet (for example `TS2307 typescript missing`) in the structured feedback instead of only the broad shell rubric; (c) add regression coverage proving TypeScript/compiler failures are not surfaced as generic missing-path errors; (d) thread that typed classification into lane summaries so downstream claws can recommend the right recovery without pane archaeology. **Why this matters:** clawability depends on the fix suggestion matching the real failure class; broad shell-error mush turns easy recoveries into manual forensic work. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
|
||||
**Blocker.** None.
|
||||
|
||||
208. **The JavaScript `extract` dogfood path has no dedicated preflight/doctor surface for its own prerequisites** — dogfooded 2026-04-19 from `dogfood-1776184671`. The repo already has strong Rust-side `claw doctor` / preflight coverage, but the direct JS CLI path I was actually dogfooding (`node dist/cli/index.js extract ...`) gave no equivalent early warning about its own prerequisites: missing `dist/cli/index.js`, missing `node_modules/typescript`, and the difference between "needs bootstrap" vs "real compile error" all had to be discovered by failing real commands in sequence. That means the lowest-friction way to validate the JS extract surface is still failure-driven archaeology rather than one explicit readiness check. **Required fix shape:** (a) add a lightweight JS-side preflight/doctor command or bootstrap check for the extract CLI path that reports artifact presence, dependency readiness, and build status before execution; (b) make that check machine-readable so lanes can say `js_extract_prereq_blocked` (or equivalent) instead of learning via stack traces; (c) document the direct dogfood path so operators know whether the supported sequence is `doctor -> help -> extract` or something else; (d) add regression coverage for a fresh worktree, a deps-missing worktree, and a ready worktree. **Why this matters:** preflight should collapse obvious prerequisite failures into one cheap truth surface instead of forcing dogfooders to burn turns discovering them one crash at a time. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
|
||||
|
||||
209. **`npm ci` can report a clean install while leaving the JS extract build path non-buildable (false-green bootstrap)** — dogfooded 2026-04-19 from `dogfood-1776184671`. The lane explicitly checked that `node_modules/typescript` was missing, then ran `npm ci`, which succeeded (`added 3 packages`, `found 0 vulnerabilities`), but the subsequent build path still surfaced a missing/invalid TypeScript toolchain situation instead of a clearly ready extract CLI bootstrap. From the operator side this is a false-green signal: the canonical package-manager bootstrap step says success, yet the next immediate action is still not reliably build-ready. Whether the root cause is missing declaration in `package.json`, lockfile drift, wrong dependency bucket, or build contract mismatch, the clawability gap is the same — `npm ci` success is not a trustworthy readiness signal for the JS extract path. **Required fix shape:** (a) define the exact dependency contract for the extract build path so `npm ci` alone yields a buildable state, or else emit an explicit follow-up requirement if another step is mandatory; (b) add a readiness assertion after install (for example checking required toolchain/deps like `typescript`) so bootstrap can fail closed instead of greenwashing; (c) add regression coverage that a clean install on a fresh worktree reaches a buildable/help-capable extract CLI state; (d) surface a typed `bootstrap_false_green` / `deps_incomplete_after_install` class when install succeeds but required build deps are still absent. **Why this matters:** bootstrap steps must mean what they say; a green install that leaves the next command red burns operator trust and makes every later failure harder to localize. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
|
||||
|
||||
210. **Updater says `Restart to use new code`, but the same interactive session continues immediately with ambiguous code provenance** — dogfooded 2026-04-19 from `clawcode-human`. After the `omx` updater ran and explicitly reported `[omx] Updated to v0.13.0. Restart to use new code.`, the same visible interactive session proceeded straight into the requested task prompt instead of forcing or clearly fencing the restart boundary. That creates a stale-binary truth gap: neither the operator nor downstream claws can tell whether the subsequent behavior is coming from the newly installed version, the pre-update in-memory process, or some mixed state where setup artifacts are refreshed but the active runtime is still old. **Required fix shape:** (a) when an update declares restart-required, surface that as a first-class blocked/degraded state (`update_applied_restart_pending`) instead of silently continuing as if task execution provenance were clean; (b) either force a real restart before accepting task prompts or stamp all subsequent events with the pre-restart runtime identity until restart happens; (c) expose version-before/version-after/runtime-active-version distinctly in status surfaces; (d) add regression coverage proving that post-update task work cannot masquerade as running on the fresh version when restart is still pending. **Why this matters:** after self-update, code provenance is the truth boundary; if the tool says "restart required" but still keeps working, every later success or failure becomes harder to attribute to the right build. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
211. **The updater prompt is automation-hostile because it defaults to affirmative mutation (`Update now? [Y/n]`) during task startup** — dogfooded 2026-04-19 from `clawcode-human`. Before any requested work began, `omx` presented `Update available: v0.12.6 → v0.13.0. Update now? [Y/n]`, meaning the default Enter path mutates the toolchain in the middle of a task-start flow. Even if the operator notices and answers intentionally, the UX contract is backwards for automation-adjacent use: the least-effort path is "change the environment now" instead of "leave the task environment stable unless explicitly opted in." **Required fix shape:** (a) make startup-time updater prompts opt-in by default (`[y/N]`) or suppress them entirely in automation/worktree/task-launch contexts; (b) expose a policy switch so maintainers can choose `never`, `ask`, or `always` update behavior explicitly instead of hidden prompt defaults; (c) classify affirmative-default update prompts as startup mutation events in telemetry so they are visible in lane history; (d) add regression coverage proving a bare Enter during task startup does not silently opt into an update unless policy explicitly allows it. **Why this matters:** default-yes mutation is the wrong trust posture for reproducible dogfood and automation; task startup should preserve environment stability unless the operator deliberately chooses otherwise. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
212. **Promotional output is mixed into the task-start surface (`Support the project: gh repo star ...`), diluting operational signal** — dogfooded 2026-04-19 from `clawcode-human`. During the same startup flow that was supposed to move from update/setup into actual task work, `omx` printed a promotional line (`Support the project: gh repo star Yeachan-Heo/oh-my-codex`) directly in the operational transcript. This is not a correctness bug by itself, but it is a clawability gap: startup/task surfaces are where operators and downstream claws are trying to detect readiness, blockers, version provenance, and prompt receipt. Injecting marketing copy into that channel increases noise exactly where the signal budget is most precious. **Required fix shape:** (a) separate promotional/community messaging from operational startup/task transcripts, or gate it behind a quiet/noninteractive mode default for task launches; (b) mark any remaining non-operational lines with explicit metadata so downstream parsers can ignore them; (c) add a policy switch for quiet task-start surfaces vs interactive human-friendly onboarding; (d) add regression coverage proving task-start transcripts contain only operationally relevant lines in automation/worktree contexts. **Why this matters:** if the same channel carries both readiness truth and promo copy, claws have to waste effort distinguishing signal from fluff right when they should be classifying blockers and executing work. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
213. **Startup can silently enter a more destructive maintenance posture (`Force mode`) before task work begins** — dogfooded 2026-04-19 from `clawcode-human`. The updater/setup transcript included `Force mode: enabled additional destructive maintenance (for example stale deprecated skill cleanup).` in the middle of task startup. Even if the maintenance is legitimate, this is a clawability gap because the runtime is declaring that it has switched into a more destructive cleanup posture before the operator’s requested task has started, yet that posture change is not fenced as a separate trust boundary with explicit operator intent, policy context, or post-change state. **Required fix shape:** (a) treat force/destructive maintenance mode as a first-class startup state transition with explicit provenance and reason, not an inline informational line; (b) require explicit policy/consent in task-launch contexts before enabling destructive maintenance, especially when the user goal was unrelated to maintenance; (c) expose what was actually cleaned/removed under force mode in structured post-run state so the operator can audit side effects; (d) add regression coverage proving ordinary task startup cannot silently widen maintenance/destructive scope without a corresponding policy signal. **Why this matters:** startup should not quietly broaden its mutation/destructive radius under the same transcript used for task execution; when trust posture changes, that change needs to be explicit, auditable, and easy to distinguish from normal startup noise. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
214. **Task-start transcript leaks internal implementation/config choreography (`HUD config`, `[tui]` ownership, section-left-untouched notes) instead of surfacing only operator-relevant state** — dogfooded 2026-04-19 from `clawcode-human`. The startup/update flow printed lines like `HUD config created (preset: focused).` and `Codex CLI >= 0.107.0 manages [tui]; OMX left that section untouched.` Those may be useful during installer development, but on a task-start surface they are low-level implementation chatter: they expose config ownership details and internal orchestration mechanics that are not the operator’s actual question (`can work start yet? what changed? what is blocked?`). **Required fix shape:** (a) separate installer/debug implementation detail logs from the operator-facing startup/task transcript; (b) summarize them into a higher-level state only when they materially affect readiness (for example `ui_config_deferred_to_host_cli`), otherwise suppress them in normal task launches; (c) provide a verbose/debug mode where maintainers can still inspect the raw choreography intentionally; (d) add regression coverage proving default task-start transcripts carry readiness/provenance/blocker facts, not installer internals. **Why this matters:** when internal config chatter and operational truth share the same transcript, claws have to reverse-engineer which lines matter; startup should communicate state, not make maintainers parse implementation archaeology every run. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
215. **Setup-scope selection defaults to user/global mutation during task startup, creating project-vs-global provenance ambiguity** — dogfooded 2026-04-19 from `clawcode-human`. The updater/setup flow prompted `Select setup scope:` and defaulted to `1) user (default)`, then continued with `Using setup scope: user` and `User scope leaves project AGENTS.md unchanged.` In a task-launch context inside a specific project worktree, this is a clawability gap: the default mutation target is the operator’s global `~/.codex` environment rather than the current project, so the startup path can change cross-project state before the task even begins. That makes it ambiguous whether later behavior comes from project-local config, user-global config, or some mixed overlay. **Required fix shape:** (a) make scope choice explicit and policy-driven in task/worktree launches instead of defaulting silently to user/global scope; (b) expose the active config/provenance stack clearly after setup (`project`, `user`, or layered`) so later behavior can be attributed correctly; (c) allow automation/worktree mode to prefer or require project-local scope by default; (d) add regression coverage proving a bare Enter at setup-scope prompt does not unexpectedly widen mutation scope beyond the current project unless policy explicitly allows it. **Why this matters:** when startup mutates global state from inside a project task flow, reproducibility and blame assignment get muddy fast; scope is part of runtime truth and needs to be explicit, not an installer default hidden in startup chatter. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
216. **Installer refresh-count dumps (`updated=`, `unchanged=`, `skipped=`...) are mixed into task-start transcript even when the operator only needs readiness truth** — dogfooded 2026-04-19 from `clawcode-human`. The startup flow printed a full `Setup refresh summary:` block with counters for prompts, skills, native agents, AGENTS.md, and config. Those counters may be useful for installer debugging, but in a task-launch transcript they are mostly bookkeeping noise: they consume operator attention without answering the task-critical questions (`did startup finish? what mutated? is restart pending? can work begin?`). **Required fix shape:** (a) move raw refresh-count summaries behind verbose/debug output or a separate installer report surface; (b) collapse default task-start output to a higher-level mutation summary only when something materially changed; (c) mark detailed installer accounting as non-operational metadata when it must remain available; (d) add regression coverage proving default task-start transcripts do not include raw installer counter dumps in automation/worktree contexts. **Why this matters:** startup transcripts should optimize for execution truth, not make claws parse installer bookkeeping while they are trying to classify blockers and begin work. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
217. **Post-setup onboarding checklists (`Next steps:`) are injected into an already-active task-launch flow, re-framing the operator as a first-time user** — dogfooded 2026-04-19 from `clawcode-human`. After the updater/setup churn, the transcript printed a `Next steps:` block (`Start Codex CLI in your project directory`, `Browse skills with /skills`, `The AGENTS.md orchestration brain is loaded automatically`, etc.) immediately before the actual task prompt. In a live project-task session this is a clawability gap: the tool already knows it is inside a project directory and about to execute a concrete prompt, yet it still emits a generic first-run onboarding checklist that competes with the real work context. **Required fix shape:** (a) suppress or relocate first-run/onboarding guidance when the launch context is an active task/worktree session rather than a fresh human install flow; (b) surface onboarding guidance only when the runtime has evidence the user actually needs it; (c) keep detailed onboarding available via explicit help/doctor/docs surfaces instead of the main task-start transcript; (d) add regression coverage proving task-launch transcripts do not append generic `Next steps` blocks once the system has already crossed into execution mode. **Why this matters:** startup truth should narrow toward the requested task, not widen back out into beginner-mode guidance after the operator has already initiated concrete work. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
218. **Floating UX tips (`Tip: New Build faster with Codex.`) intrude into the task-start truth surface even when the session is about to execute real work** — dogfooded 2026-04-19 from `clawcode-human`. Right after the startup banner and before the actual task prompt took over, the surface displayed `Tip: New Build faster with Codex.` This kind of ambient tip may be harmless in a purely interactive onboarding context, but in a task-launch transcript it is another piece of non-operational noise competing with the real signals: readiness, prompt receipt, blocked state, restart pending, and execution provenance. **Required fix shape:** (a) suppress floating tips by default in task/worktree/automation launch contexts; (b) if tips remain in interactive mode, label them as ignorable non-operational UI hints outside the main transcript channel; (c) provide an explicit `tips=on/off/auto` policy so operators can keep startup surfaces quiet when they need clean telemetry; (d) add regression coverage proving task-start transcripts do not include generic tips once the system has enough context to know it is in execution mode. **Why this matters:** claws need startup transcripts to be high-signal; ambient tips are cheap for humans to ignore but expensive for automation and postmortem parsing because they widen the same channel that carries actual state transitions. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
219. **The full startup banner still occupies prime task-start transcript space even in an execution-bound session** — dogfooded 2026-04-19 from `clawcode-human`. Before any real work state was surfaced, the session rendered the large `OpenAI Codex (v0.120.0)` banner block with model and directory chrome. A banner is fine for an interactive REPL landing page, but in a task-launch/worktree context it is another large piece of non-operational framing that pushes actual readiness/provenance/blocker signals further down the transcript. This is distinct from the old piped-stdin bug (#48): here the issue is not wrong mode selection, but that once execution mode is already known, the banner still claims the most visible part of the startup surface. **Required fix shape:** (a) suppress or collapse the full banner in task/worktree/automation launches once the system knows it is entering execution immediately; (b) if some context is still useful, reduce it to one compact machine-readable/header line rather than a decorative block; (c) keep the full banner for explicit interactive landing contexts only; (d) add regression coverage proving execution-bound launches surface readiness/provenance first, not the decorative REPL chrome. **Why this matters:** startup transcript real estate is scarce; when the banner consumes the top of the screen, claws and operators pay a tax just to get to the lines that actually determine whether work can proceed. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
220. **Model/directory context is only exposed as decorative banner chrome instead of a stable structured startup state surface** — dogfooded 2026-04-19 from `clawcode-human`. The session showed useful facts like `model: gpt-5.4 high` and `directory: /mnt/offloading/Workspace/claw-code`, but only inside the decorative startup banner block. That means the context is visually present for a human yet not surfaced as a clearly structured, low-noise state line/event that claws can reliably consume once banners are suppressed or compacted. **Required fix shape:** (a) expose active model, cwd/project root, and similar startup context as a compact structured state surface independent of the decorative banner; (b) keep the data available even when banners are hidden in task/worktree/automation mode; (c) ensure downstream status/lane events can consume the same fields without scraping presentation text; (d) add regression coverage proving model/cwd context survives banner suppression and remains visible in a machine-usable form. **Why this matters:** some startup context is genuinely important, but if it only exists as banner chrome then operators must choose between noisy presentation and losing state; the truth should live in structured state, not decorative formatting. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
221. **Setup progress numbering uses ad-hoc fractional steps (`[5.5/8]`), which blurs startup phase truth instead of clarifying it** — dogfooded 2026-04-19 from `clawcode-human`. The updater/setup transcript labeled one phase as `[5.5/8] Verifying Team CLI API interop...`, which reads like an implementation-side patch to the step list rather than a stable user-facing phase model. It is a small thing, but it is a real clawability gap: when startup phase numbering itself looks improvised, operators and downstream claws cannot tell whether phases are canonical, inserted dynamically, optional, or comparable across runs. **Required fix shape:** (a) expose startup/setup phases as stable named states instead of ad-hoc fractional numbering; (b) if dynamic substeps are needed, nest them structurally under a parent phase instead of mutating the visible top-level ordinal; (c) make machine-readable startup telemetry use canonical phase ids rather than presentation-only counters; (d) add regression coverage proving startup phase sequencing remains stable even when intermediate validation steps are added. **Why this matters:** phase numbering should reduce ambiguity, not advertise that the startup model is being patched live; claws need stable phase identity for comparison, dedupe, and blocker attribution across runs. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
222. **Task-start transcript still tells the operator to `Run "omx doctor" to verify installation` even after the session has already crossed into active execution flow** — dogfooded 2026-04-19 from `clawcode-human`. The updater/setup path printed `Setup complete! Run "omx doctor" to verify installation.` immediately before continuing into the live project task prompt. In a first-run install flow that guidance is fine; in an already-active task/worktree launch it is a diversionary fork that reintroduces setup validation as if the operator were still onboarding instead of already trying to execute concrete work. **Required fix shape:** (a) suppress doctor/verification nudges once the runtime knows it is in an execution-bound task launch rather than a fresh install session; (b) if verification remains relevant, encode it as a structured optional recommendation separate from the main transcript, not a blocking-looking imperative sentence; (c) keep `doctor` guidance available on explicit help/status/install surfaces; (d) add regression coverage proving task-launch transcripts do not instruct users to re-verify installation mid-launch unless a real installation-health blocker is present. **Why this matters:** task-start truth should converge on the requested work; reintroducing `run doctor` guidance at the last moment makes the runtime look uncertain about whether startup is complete and distracts both humans and claws from execution. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
223. **Capability-detection chatter (`omx team api command detected`, `CLI-first interop ready`) leaks into task-start transcript instead of being summarized as stable readiness state** — dogfooded 2026-04-19 from `clawcode-human`. During setup the transcript printed lines like `omx team api command detected (CLI-first interop ready)`. That may be useful during installer debugging, but in a task-launch transcript it is low-level capability-probing chatter: it tells the operator how the installer discovered a capability instead of simply surfacing the resulting readiness fact, if that fact even matters to the current task. **Required fix shape:** (a) hide raw capability-detection chatter from the default task-start transcript; (b) if the result matters, summarize it as a stable named readiness capability or degraded state rather than a probe log; (c) keep raw probe details in verbose/debug output only; (d) add regression coverage proving startup surfaces do not emit ephemeral detection strings in execution-bound launches. **Why this matters:** claws need canonical state, not probe narration; when startup transcripts describe how readiness was detected rather than the readiness outcome itself, downstream consumers have to reverse-engineer transient strings instead of reading stable state. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
224. **Backup side effects are reported only as installer bookkeeping (`backed_up=...`) inside startup chatter instead of as an explicit auditable mutation surface** — dogfooded 2026-04-19 from `clawcode-human`. The setup refresh summary included counts like `config: updated=1, unchanged=1, backed_up=1`, which means startup created backup artifacts or backup state as part of the run. That is a real side effect, but it is only exposed as a counter inside noisy installer bookkeeping. In a task-launch context this is a clawability gap: backups are mutation/audit facts, not just installer trivia, and they should be easy to attribute and inspect without scraping summary counts. **Required fix shape:** (a) surface backup creation as an explicit structured mutation event (what was backed up, where, why) rather than only a counter; (b) keep backup/audit details in a dedicated mutation report separate from the main task-start transcript; (c) allow operators to inspect or suppress routine backup chatter without losing auditability; (d) add regression coverage proving backup side effects remain attributable even when installer counter dumps are hidden. **Why this matters:** when startup mutates disk state, the audit trail should be crisp and intentional; hiding backups inside generic `updated/unchanged/backed_up` counters makes real side effects look like disposable noise. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
225. **Installer mutation summaries are aggregate-only (`updated=`, `skipped=`, `removed=` counts) and hide which concrete artifacts changed** — dogfooded 2026-04-19 from `clawcode-human`. The `Setup refresh summary` reported counters for prompts, skills, native agents, AGENTS.md, and config, but not the identities of the files/items that were actually updated, skipped, backed up, or removed. That creates an item-level opacity gap: even when the operator accepts that startup did maintenance, they still cannot tell what concretely changed without diffing the filesystem or rerunning in a more verbose mode. **Required fix shape:** (a) expose a structured per-item mutation report (or stable pointer to one) alongside the aggregate counts; (b) let the default task-start transcript stay quiet while still preserving an auditable item list off the main path; (c) distinguish no-op categories from real mutated identities so downstream claws can tell whether a count reflects actual risk; (d) add regression coverage proving installer summaries remain attributable at the item level even when only compact high-level output is shown by default. **Why this matters:** counts alone are not enough for trust — when startup says it changed "some" prompts/skills/config, claws need a stable way to know exactly which artifacts moved without scraping or manual archaeology. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
226. **Installer summary status labels (`unchanged`, `skipped`, `removed`, `updated`) are not semantically crisp enough for downstream interpretation** — dogfooded 2026-04-19 from `clawcode-human`. The startup transcript emitted category counters like `updated=0, unchanged=20, skipped=13, removed=0`, but the semantics of those buckets are not self-evident in a machine-usable way: does `skipped` mean policy-blocked, out-of-scope, user-owned, version-pinned, or transient failure? Does `unchanged` mean verified identical, or merely not touched? That ambiguity makes the counts hard to trust even before item-level detail is considered. **Required fix shape:** (a) define stable semantics for each installer outcome bucket and expose them in machine-readable form; (b) avoid overloading `skipped`/`unchanged` for multiple reasons — use typed subreasons when needed; (c) ensure compact summaries can still distinguish harmless no-op from policy suppression or deferred action; (d) add regression coverage proving outcome labels remain stable and unambiguous across installer changes. **Why this matters:** if the status words themselves are fuzzy, aggregate counts become misleading telemetry — claws cannot tell whether startup was clean, partially suppressed, or silently deferred without reverse-engineering installer internals. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
227. **Task startup degrades into an interactive installer questionnaire (update? scope?) instead of a deterministic launch contract** — dogfooded 2026-04-19 from `clawcode-human`. Before any project work began, the launch path required answering multiple setup questions (`Update now? [Y/n]`, `Select setup scope: ... Scope [1-2]`) and only then continued into updater/setup churn and the eventual task prompt. This is a distinct clawability gap from the individual prompt defaults: even if each default were safer, the overall startup contract is still questionnaire-driven rather than deterministic. A task/worktree launch should be able to evaluate policy and either proceed or surface a typed blocked state, not stop for a mini installer interview. **Required fix shape:** (a) replace startup questionnaires with explicit policy-driven decisions and typed states (`update_required`, `scope_resolution_required`, etc.); (b) reserve interactive questioning for explicit install/setup commands, not ordinary task-launch paths; (c) provide a noninteractive/automation-safe mode where launch decisions are resolved from config/policy alone; (d) add regression coverage proving execution-bound launches either start deterministically or fail with structured blockers instead of pausing for ad-hoc Q&A. **Why this matters:** questionnaires destroy launch determinism; claws cannot reliably classify or replay startup when the runtime keeps asking humans to steer installer choices in the middle of task execution. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
228. **Startup success confirmations collapse into repeated generic `Done.` lines with weak object identity** — dogfooded 2026-04-19 from `clawcode-human`. Across the setup flow, multiple steps ended with bare confirmations like `Done.` after labels such as `Creating directories`, `Configuring notification hook`, and similar installer actions. That is a small but real event/log opacity gap: once the transcript gets longer, a claw or human skimming later cannot tell what exact artifact or side effect each `Done.` line is attesting to without walking back through the surrounding prose. **Required fix shape:** (a) emit success confirmations with stable object identity (`directories_created`, `notification_hook_configured`, etc.) instead of bare `Done.`; (b) keep human-friendly summaries if desired, but pair them with structured outcome ids; (c) make compact task-start transcripts collapse repetitive successful maintenance lines unless they materially affect readiness; (d) add regression coverage proving startup confirmations remain attributable even after transcript compaction or banner suppression. **Why this matters:** opaque success acknowledgments are the mirror image of opaque failures — if the runtime cannot say what specifically succeeded, later audits and parsers have to reconstruct state from surrounding noise instead of reading a stable event surface. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
229. **`Setup complete!` is emitted as a false-completion signal even while restart-required / execution-readiness ambiguity still exists** — dogfooded 2026-04-19 from `clawcode-human`. The startup flow printed `Setup complete!` even though the same transcript also said `Updated to v0.13.0. Restart to use new code.` and then continued into a noisy task-launch path with unclear runtime provenance. That makes `Setup complete!` a misleading terminal state label: it reads like the environment is fully ready and settled when in reality restart is still pending and execution truth is still muddy. **Required fix shape:** (a) reserve `complete`/`ready` language for genuinely execution-ready states only; (b) when restart or policy resolution is still pending, emit a degraded or transitional state instead (`setup_applied_restart_pending`, `setup_applied_not_ready`, etc.); (c) make human-facing copy and machine-facing state agree on whether the launch is actually ready for work; (d) add regression coverage proving no completion banner is shown while mandatory follow-up state (restart, consent, scope resolution) remains unresolved. **Why this matters:** false green completion signals poison the whole startup surface — once the runtime says `complete` too early, every later blocker or ambiguity looks like a contradiction instead of a known pending state. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
230. **Post-setup guidance can directly contradict observed reality (`Start Codex CLI in your project directory`) even though the session is already inside Codex in that directory** — dogfooded 2026-04-19 from `clawcode-human`. After startup had already entered the Codex UI and clearly showed `directory: /mnt/offloading/Workspace/claw-code`, the `Next steps:` block still instructed `Start Codex CLI in your project directory`. This is sharper than generic onboarding noise: it is self-contradicting guidance emitted in the same transcript that already proves the instruction has been satisfied. **Required fix shape:** (a) suppress any next-step/help guidance that is contradicted by current runtime state; (b) make onboarding copy state-aware so already-satisfied steps are removed or marked complete instead of repeated as advice; (c) ensure task-launch transcripts prefer observed facts over canned checklists; (d) add regression coverage proving startup help text does not instruct the user to do something the runtime already knows is true. **Why this matters:** contradictory guidance corrodes trust faster than generic noise — once the transcript tells the user to do something they are visibly already doing, every other startup instruction becomes suspect too. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
231. **Task-start transcript uses internal/anthropomorphic claims (`The AGENTS.md orchestration brain is loaded automatically`) instead of verifiable readiness facts** — dogfooded 2026-04-19 from `clawcode-human`. The `Next steps:` block included `The AGENTS.md orchestration brain is loaded automatically`, which is not a crisp operational fact but an internal/marketing-ish claim about the system’s conceptual model. In a task-launch transcript this is a clawability gap: the line sounds important, but it does not say what was actually loaded, how to verify it, or whether it affects current readiness. **Required fix shape:** (a) replace anthropomorphic/internal claims in startup/task surfaces with verifiable state facts (`AGENTS.md loaded: yes/no`, `policy file path`, `load source`, etc.) when such state matters; (b) keep conceptual/product-language copy out of operational transcripts or confine it to docs/onboarding surfaces; (c) make every startup claim testable against observable runtime state; (d) add regression coverage proving task-launch transcripts surface factual state instead of unverifiable product prose. **Why this matters:** claws can only reason over checkable truth; when startup surfaces speak in metaphor or internal branding, downstream consumers cannot distinguish “important state” from “colorful copy,” and auditability collapses. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
232. **Startup lacks a canonical final verdict line/state (`READY`, `BLOCKED`, `RESTART_REQUIRED`, etc.), forcing claws to infer readiness from noisy transcript fragments** — dogfooded 2026-04-19 from `clawcode-human`. After update prompts, scope questions, setup steps, summaries, tips, and onboarding chatter, the transcript never emitted one authoritative machine-usable outcome that settled the startup state. Instead, the operator had to infer from scattered lines like `Setup complete!`, `Restart to use new code.`, and subsequent prompt availability. This is a core event/log opacity gap: even if every individual line were cleaner, claws still need one canonical startup verdict to know whether the session is truly ready, degraded, blocked, or restart-pending. **Required fix shape:** (a) emit a single explicit startup outcome state at the end of launch (`ready`, `blocked`, `restart_required`, `setup_degraded`, etc.); (b) make that verdict authoritative over incidental transcript prose and reusable in lane/status events; (c) attach the minimal structured reasons that led to the verdict so downstream consumers do not have to scrape prior chatter; (d) add regression coverage proving every execution-bound launch terminates its startup phase with exactly one canonical verdict. **Why this matters:** without a final authoritative verdict, startup remains chat archaeology — claws cannot reliably decide whether to proceed, wait, or remediate because readiness lives only in the reader’s interpretation of noisy text. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
233. **Startup and task execution share one undifferentiated transcript stream; there is no explicit handoff boundary from setup/maintenance into real work** — dogfooded 2026-04-19 from `clawcode-human`. The same surface flowed from updater prompts, setup-scope questions, installer progress, summaries, tips, and onboarding text directly into the actual task prompt with no clean phase break that said “startup is over; execution has begun.” This is distinct from #232’s missing final verdict: even if a verdict existed, claws still need a visible handoff boundary so later lines can be interpreted as task execution rather than residual setup chatter. **Required fix shape:** (a) emit an explicit phase transition when control passes from startup/setup into execution (`startup_finished`, `execution_begin`, or equivalent); (b) keep startup/maintenance events logically grouped and separate from task-turn events in lane history; (c) make the handoff boundary machine-readable so downstream consumers can split logs without heuristic scraping; (d) add regression coverage proving execution-bound launches expose one clear startup→execution boundary even when startup performs updates or setup work first. **Why this matters:** without a crisp handoff, every later line is ambiguous — claws cannot tell whether they are reading installer residue or real task progress, so monitoring, replay, and blame assignment all stay fuzzy. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
234. **Startup phases expose almost no elapsed-time signal, so operators cannot tell which pre-task step actually consumed launch latency** — dogfooded 2026-04-19 from `clawcode-human`. The launch path spent real time in update prompting, setup scope selection, setup refresh, interop checks, config work, and onboarding chatter before real work began, but the transcript gave almost no per-phase timing or duration summary. That makes startup friction hard to localize: claws can see that startup felt long, but not whether the time went to update/install, config rewrite, capability probing, restart-pending drift, or UI chatter. **Required fix shape:** (a) attach elapsed timing to major startup phases and the final startup verdict; (b) expose a compact duration breakdown for update/setup/probe/handoff phases in machine-readable form; (c) keep detailed timings available even when the visible transcript is compacted; (d) add regression coverage proving execution-bound launches can report where pre-task latency was spent without log scraping. **Why this matters:** if startup latency is opaque, every slowdown becomes anecdotal. Claws need timing attribution to decide whether to suppress noise, precompute setup, change policy defaults, or fix a real blocker. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
235. **Startup decisions have no policy-source attribution, so prompts and mutations appear arbitrary (`why am I being asked to update/scope-switch/force-maintain?`)** — dogfooded 2026-04-19 from `clawcode-human`. The launch path asked about updates, defaulted to user scope, entered force mode, and emitted various setup actions, but the transcript never said which config, policy, default rule, or caller context caused those decisions. The operator can see *what* happened, but not *why this branch was chosen*. That creates a policy-opacity gap on top of the noise: even if the prompts were fewer, claws still could not audit whether a choice came from explicit config, a default fallback, current repo context, or installer hardcode. **Required fix shape:** (a) attach policy-source metadata to startup decisions (`source=config`, `source=default`, `source=interactive_override`, `source=repo_policy`, etc.); (b) surface compact reason/source tags for major mutations and prompts without dumping raw config internals; (c) make the final startup verdict include the key policy inputs that shaped launch; (d) add regression coverage proving update/scope/force-mode decisions remain attributable after transcript compaction. **Why this matters:** startup trust is not just about the visible action — it is about whether claws can trace that action back to an intentional policy source instead of treating it like arbitrary runtime whim. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
236. **Setup refresh has no drift/trigger explanation, so repeated pre-task maintenance looks unconditional even when it may be idempotent or unnecessary** — dogfooded 2026-04-19 from `clawcode-human`. The launch path ran a broad setup refresh and printed counts (`updated`, `unchanged`, `skipped`, `backed_up`), but never explained why this refresh was needed on this run: stale install detected, version mismatch, missing files, policy-enforced reapply, or just unconditional startup behavior. That leaves a critical ambiguity: the operator can see maintenance happened, but cannot tell whether it was justified by detected drift or simply rerun every time. **Required fix shape:** (a) emit a compact trigger reason for startup maintenance (`version_drift`, `missing_artifacts`, `policy_reapply`, `first_run`, `forced_refresh`, etc.); (b) include whether the refresh was necessary, opportunistic, or unconditional; (c) surface the trigger reason in the final startup verdict and structured mutation report; (d) add regression coverage proving repeated launches can distinguish "no drift, no refresh needed" from "refresh intentionally rerun because X." **Why this matters:** without drift/trigger attribution, startup maintenance feels arbitrary and expensive — claws cannot decide whether to cache, suppress, precompute, or eliminate the work because they do not know why it fired. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
237. **Repeated startup maintenance exposes no idempotence/fast-path signal, so claws cannot tell whether the runtime short-circuited safely or re-executed the whole setup pipeline** — dogfooded 2026-04-19 from `clawcode-human`. The setup flow reported lots of `unchanged` counts, but the transcript never made clear whether that meant a true cheap no-op fast path, a full scan/rewrite pass that happened to find no diffs, or a partially skipped installer run. This is distinct from #236’s missing trigger reason: even if a refresh was justified, the operator still cannot tell whether repeated launches are paying the full maintenance cost or benefiting from a stable idempotent shortcut. **Required fix shape:** (a) expose whether startup maintenance took a `fast_path`, `full_scan_noop`, `partial_reapply`, or `mutating_refresh` route; (b) include compact machine-readable idempotence metadata in startup verdicts and maintenance reports; (c) separate “no changes needed” from “work rerun but produced no diffs” so downstream systems can reason about startup cost; (d) add regression coverage proving repeated launches report a stable idempotence mode rather than forcing consumers to infer it from counters. **Why this matters:** idempotence is part of startup truth — without it, claws cannot optimize repeated launches or explain why startup still feels heavy even when nothing changed on disk. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
238. **Startup prompts do not preserve answer provenance (`explicit user choice` vs `accepted default`), so later audit cannot tell who actually chose update/scope branches** — dogfooded 2026-04-19 from `clawcode-human`. The launch flow showed questionnaire-style prompts such as `Update now? [Y/n]` and `Scope [1-2] (default: 1):`, but the resulting transcript only reflected the chosen path (`Using setup scope: user`, updater executed) without clearly recording whether those outcomes came from explicit operator input, default acceptance, automation, or some other implicit branch. That is a real audit gap: even if startup decisions become policy-driven later, the current surface cannot reconstruct whether a risky branch was intentionally chosen or simply happened because Enter accepted the default. **Required fix shape:** (a) record answer provenance for startup decisions (`explicit_input`, `default_accepted`, `policy_auto`, `preconfigured`) in machine-readable form; (b) surface compact provenance tags for consequential branches like update/scope/force mode; (c) thread answer provenance into the final startup verdict and audit trail; (d) add regression coverage proving startup decisions remain attributable after transcript compaction and banner suppression. **Why this matters:** when a launch mutates the environment, it is not enough to know what branch happened — claws need to know whether a human actually chose it or whether the system silently fell through to a default. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
239. **Startup transcript has no severity/importance layering, so blockers, mutations, info, and tips all compete at the same visual priority** — dogfooded 2026-04-19 from `clawcode-human`. In the same startup surface, lines about restart-required state, updater actions, setup mutations, promo copy, onboarding guidance, tips, and installer bookkeeping all appeared as ordinary transcript entries with no stable severity cues. That means the operator has to manually decide which lines are blockers, which are side-effect audit facts, and which are safely ignorable. **Required fix shape:** (a) assign stable severity/importance classes to startup events (`blocker`, `mutation`, `readiness`, `info`, `hint`, etc.); (b) make the final startup verdict and compact transcript prioritize blocker/readiness signals above all other classes; (c) let downstream consumers filter or collapse lower-severity startup chatter without losing auditability; (d) add regression coverage proving startup surfaces preserve severity ordering even when verbose output is enabled. **Why this matters:** even perfect wording is not enough if every line has equal visual weight — claws need severity structure so the startup surface can be parsed by priority instead of by brute-force reading order. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
240. **Startup mixes persistent mutations and ephemeral observations in the same plain-text channel, so operators cannot quickly tell what changed on disk/config versus what was merely detected** — dogfooded 2026-04-19 from `clawcode-human`. The transcript interleaved observations like capability detection, version notices, and tips with persistent side effects like config refreshes, backups, hook setup, and possible global-scope mutation, but rendered them all as ordinary prose lines. That makes audit and recovery harder: a claw reading back later cannot immediately separate "this was observed" from "this changed machine state." **Required fix shape:** (a) classify startup events by persistence class (`observation`, `decision`, `mutation`, `audit_artifact`) in addition to severity; (b) provide a compact mutation-only view or structured ledger for the startup run; (c) keep ephemeral observations available without letting them obscure which events actually changed durable state; (d) add regression coverage proving startup surfaces preserve the distinction between detected facts and persisted side effects. **Why this matters:** when startup changes the machine, claws need a fast path to the durable side effects. Without a persistence distinction, every audit becomes transcript archaeology instead of a clean state-change review. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
241. **Startup emits many lines but no stable startup-attempt/run id, so downstream claws cannot reliably group which prompts, mutations, and verdict belong to the same launch** — dogfooded 2026-04-19 from `clawcode-human`. The startup flow included update prompting, scope selection, setup steps, summaries, restart-required messaging, onboarding spillover, and then task execution, but none of those lines carried a shared startup correlation id. That makes analysis brittle once multiple launches or retries exist nearby: parsers have to infer grouping by proximity instead of knowing "these 23 lines belong to startup attempt X." **Required fix shape:** (a) assign a stable startup run id/correlation id at launch begin; (b) attach it to startup prompts, mutations, summaries, verdicts, and the startup→execution handoff; (c) preserve the id in compact transcript mode and structured lane/status events; (d) add regression coverage proving concurrent/retried launches remain separable without heuristic log scraping. **Why this matters:** without correlation identity, even improved startup events stay hard to stitch together across retries, compaction, and neighboring sessions. A canonical run id turns noisy startup text into a coherent attributable execution record. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
242. **Startup events have no stable sequence index inside a run, so downstream claws cannot reconstruct exact event order without trusting transcript layout** — dogfooded 2026-04-19 from `clawcode-human`. Even within one startup attempt, the flow mixed prompts, setup phases, summaries, restart-required signals, onboarding spillover, and the execution handoff without any monotonic event numbering or ordered machine-readable sequence marker. This is adjacent to #241 but distinct: a run id can tell you *which* launch a line belongs to, but not the exact canonical order of steps once output is compacted, reflowed, partially hidden, or merged into other status surfaces. **Required fix shape:** (a) assign a monotonic startup event sequence index within each startup run; (b) carry that sequence through structured startup events, summaries, and the final verdict/handoff; (c) preserve sequence identity when rendering compact human transcripts so downstream consumers can recover true order without scraping visual layout; (d) add regression coverage proving startup ordering remains reconstructable across retries, compaction, and alternate renderers. **Why this matters:** grouping without ordering is only half the audit trail. Claws need canonical event order to tell whether a blocker preceded a mutation, whether a verdict came before or after restart-required, and whether setup really finished before execution began. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
243. **Startup prompts ask for consent without previewing the concrete mutation plan, so `yes/no` decisions are under-informed** — dogfooded 2026-04-19 from `clawcode-human`. The launch path asked questions like `Update now? [Y/n]` and then proceeded into global install, setup refresh, config rewrites/backups, notification/HUD changes, possible force-mode maintenance, and restart-required state — but the prompt itself did not preview that concrete mutation set before asking for consent. This is a distinct clawability gap from policy/source attribution: even if the decision source were known, the operator still was not shown a compact “what will change if you say yes” plan before choosing. **Required fix shape:** (a) provide a concise mutation preview before consequential startup prompts (`will update package`, `may rewrite config`, `may create backups`, `restart required`, scope target, etc.); (b) make the preview machine-readable so automation and logs can capture the intended mutation set before execution; (c) allow policy-driven noninteractive mode to log the same preview as a preflight plan instead of asking interactively; (d) add regression coverage proving startup consent points expose their concrete planned side effects before mutation begins. **Why this matters:** consent without a change preview is barely better than blind defaulting — claws need to know not just that a branch exists, but what durable consequences that branch will have before they approve or auto-resolve it. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
244. **Startup has no dry-run / inspect-only path for mutation-heavy setup decisions, so the only way to learn what would happen is to start mutating** — dogfooded 2026-04-19 from `clawcode-human`. The launch path combined update prompting, scope selection, setup refresh, config rewrite/backups, force-mode maintenance, and restart-required drift, but there was no obvious dry-run or inspect-only startup contract that would let an operator ask “what would this launch do?” without already entering the mutation flow. This is adjacent to #243’s missing mutation preview, but broader: even a good inline preview still leaves no reusable no-side-effect mode for automation, audits, or preflight debugging. **Required fix shape:** (a) add a startup dry-run / inspect-only mode that evaluates policy, detects drift, computes the mutation plan, and emits the same canonical startup verdict without applying changes; (b) make that dry-run output machine-readable and structurally identical enough to compare with a real run; (c) ensure task/worktree automation can call the inspect path before deciding whether to allow mutation; (d) add regression coverage proving startup planning can be observed without side effects and that real execution matches the planned mutation set. **Why this matters:** when startup can rewrite global/user/project state, “show me the plan without touching anything” is a core clawability contract, not a luxury. Without it, every audit begins after the machine has already been changed. Source: live dogfood session `clawcode-human` on 2026-04-19.
|
||||
|
||||
245. **`oc-work send` can fail as a silent control-plane misfire (usage dump / missing required context) instead of a typed delivery error with correction guidance** — dogfooded 2026-04-20 from the live #claw-code coordination lane while Jobdori tried to steer sisyphus on ROADMAP #127. The first `oc-work send` attempt printed underlying script usage (`Usage: send-prompt.sh ...`) because `--session` was missing, but from the outer operator view that looked like a vague tool hiccup rather than a precise control-plane delivery failure. The command only succeeded after manually discovering the active session id and reissuing with `--session ses_25725e95fffe882FpmeZNL1HdA`. **Required fix shape:** (a) promote missing required control-plane context (like target session id) into a typed `delivery_blocked_missing_session` / `invalid_send_target` error instead of raw usage echo from an inner script; (b) when a send command can infer or list likely active session ids, surface that guidance directly in the error; (c) ensure failed sends emit an explicit `not delivered` outcome so operators do not confuse usage text with successful steering; (d) add regression coverage proving `oc-work send` failures preserve operator intent, classify the missing arg correctly, and never masquerade as opaque shell noise. **Why this matters:** control-plane misfires are worse than ordinary tool failures because they create false confidence that steering happened when it did not. For multi-agent clawhip/agentika loops, send-path auditability has to be crisp. Source: live Jobdori / agentika steering thread in #claw-code on 2026-04-20.
|
||||
|
||||
246. **Dogfood reminder cron can self-fail by timing out during active cycles, so the nudge loop itself is not trustworthy as an observability surface** — dogfooded 2026-04-21 in `#clawcode-building-in-public` after multiple consecutive alerts: `Cron job "clawcode-dogfood-cycle-reminder" failed: cron: job execution timed out` at 14:14, 14:24, 14:34, 14:44, 15:13, and 15:23 KST while the same dogfood cycle was actively producing reports and fixes. This is not just scheduler noise — it is a clawability gap in the reminder/control loop itself. A downstream claw seeing both repeated dogfood nudges and repeated cron timeouts cannot tell whether the reminder actually delivered, partially delivered, duplicated, or died after side effects. **Required fix shape:** (a) classify reminder execution outcome explicitly (`delivered`, `timed_out_after_send`, `timed_out_before_send`, `suppressed_as_duplicate`, `skipped_due_to_active_cycle`) instead of a single generic timeout; (b) attach the target message/report cycle id and whether a Discord post was already emitted before timeout; (c) add a fast-path/no-op path when the cycle state is unchanged or an active report is already in flight so the reminder job can exit cleanly instead of hanging; (d) add regression coverage proving repeated unchanged-state cycles do not stack timeouts or duplicate nudges. **Why this matters:** if the reminder loop itself is ambiguous, claws waste time responding to scheduler artifacts instead of real product state, and the dogfood surface stops being a reliable source of truth. Source: live clawhip/Jobdori dogfood cycle on 2026-04-21 with repeated timeout alerts in `#clawcode-building-in-public`.
|
||||
|
||||
247. **MCP memory permission prompts can recur after a transport failure, leaving an active worker blocked in a second consent loop instead of a typed degraded state** — dogfooded 2026-04-27 from live session `clawcode-human` while responding to the claw-code dogfood nudge. The session first asked permission for `omx_memory.project_memory_read`; after approval, the call failed with `Transport closed`, then the runtime immediately attempted `omx_memory.notepad_read` and blocked again on a fresh allow prompt. From the outside this looks like an automation-hostile MCP lifecycle gap: the worker is neither cleanly ready nor cleanly failed, and downstream claws must scrape the pane to learn that memory MCP is both consent-gated and transport-degraded. **Required fix shape:** (a) after an MCP transport closes, emit a typed degraded state such as `mcp_transport_closed` with server/tool identity; (b) suppress or batch follow-up permission prompts for the same failed MCP server until transport recovery is proven; (c) expose whether the task can continue without that MCP tool or is blocked on memory; (d) add regression coverage for `permission granted -> transport closed -> follow-up tool attempt` so it becomes one structured blocker instead of repeated interactive consent loops. **Why this matters:** MCP memory should either be available, explicitly degraded, or explicitly blocked; repeated permission prompts after a closed transport make prompt delivery and readiness ambiguous. Source: live `clawcode-human` pane on 2026-04-27 04:3x UTC.
|
||||
**Source.** Jobdori dogfood sweep 2026-04-22 08:56 KST — grepped `src/runtime.py` and `src/query_engine.py` for any timeout/deadline/wall-clock mechanism; found none.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user