file: #161 — run_turn_loop has no wall-clock timeout, stalled turn blocks indefinitely

2026-06-16 06:36:50 +08:00 · 2026-04-22 08:57:38 +09:00 · 2026-04-22 08:57:38 +09:00 · 424d5aff74
commit 424d5aff74
parent f65b2b4f0e
1 changed files with 11 additions and 9 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@ -6158,22 +6158,24 @@ load_session('nonexistent')  # raises FileNotFoundError with no structured error
 **Blocker.** None.
 **Source.** Jobdori dogfood sweep 2026-04-22 08:46 KST — inspected `src/session_store.py` public API, confirmed only `save_session` + `load_session` present, no list/delete/exists surface.
 200. **Interactive MCP/tool permission prompts are invisible blockers** — **done (verified 2026-04-27):** worker boot observation now detects interactive tool permission gates such as `Allow the omx_memory MCP server to run tool "project_memory_read"?` before generic readiness/idle handling, records `tool_permission_required` status, emits a structured `ToolPermissionPrompt` payload with server/tool identity, prompt age, allow-scope capability, and prompt preview, marks readiness snapshots as blocked, and carries `tool_permission_prompt_detected` through startup timeout evidence so the classifier returns `tool_permission_required` instead of a vague stale/idle/ready outcome. Regression coverage locks both the structured prompt-gate event metadata and startup-timeout classification paths. **Original filing below.**
 Original filing (2026-04-18): the session emitted `SessionStart hook (completed)` and `UserPromptSubmit hook (completed)`, then stalled on an interactive MCP permission gate (`Allow the omx_memory MCP server to run tool "project_memory_read"?`). From the outside this looks like a ready-but-quiet lane even though the real state is `blocked waiting for permission`. **Required fix shape:** (a) detect interactive MCP/tool permission prompts as a first-class blocked state instead of generic idle; (b) emit a typed event such as `blocked.mcp_permission` / `blocked.tool_permission` with tool/server name, prompt age, and whether the gate is session-only vs always-allow capable; (c) include this gate in startup/no-evidence evidence bundles and lane status surfaces so clawhip can say "blocked at MCP permission prompt" without pane scraping; (d) add a regression proving a prompt-gated session does not get misclassified as stale/idle/ready. **Why this matters:** prompt acceptance and startup telemetry are still incomplete if an interactive MCP gate can eat the first real action after hooks report success. Source: live dogfood session `clawcode-human` on 2026-04-18.
-201. **`extract --model-payload` is not inspectable enough for deterministic dogfood: forced mode selection missing, and hybrid/no-snippet cases are opaque** — dogfooded 2026-04-19 from `dogfood-1776184671` against three real-repo files. `node dist/cli/index.js extract <file> --model-payload` succeeded and auto-selected `raw`, `raw`, and `hybrid`, but there is currently no CLI surface to force `raw` / `compressed` / `hybrid` for A/B comparison: `--mode raw` and `--mode compressed` both fail immediately with `Error: Unexpected extract argument: --mode`. That turns payload-shaping validation into guesswork because operators cannot ask the extractor to render the same file through each mode and compare the exact output. The opacity is worse in the observed hybrid case: the Formbricks checkbox file produced a hybrid payload with no snippets, leaving no visible explanation for why the extractor chose hybrid, what evidence it kept vs dropped, or whether the result is correct vs a silent fallback. **Required fix shape:** (a) add an explicit debug/inspection flag that forces extraction mode (`--mode raw|compressed|hybrid` or equivalent) without changing default auto-selection; (b) print/report the chosen mode and the decision reason in a machine-readable field when `--model-payload` is used; (c) when hybrid emits zero snippets, surface an explicit reason/count summary instead of making "no snippets" indistinguishable from silent loss; (d) add regression coverage on at least one real-world hybrid fixture so mode choice and snippet accounting stay stable. **Why this matters:** direct claw-code dogfood needs deterministic payload comparison to debug startup/context quality; without forced-mode inspection and snippet accounting, operators can see the outcome but not the extraction decision that produced it. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
+## Pinpoint #161. `run_turn_loop` has no wall-clock timeout — a stalled turn blocks indefinitely
-202. **`extract --model-payload` emits `filePath` values that can walk outside the current repo root for external targets** — dogfooded 2026-04-19 from `dogfood-1776184671` while extracting files from sibling repos under `/home/bellman/Workspace/fooks-test-repos/...` with cwd anchored at the claw-code repo. In all three successful payloads (`raw`, `raw`, `hybrid`), the reported `filePath` became a relative path like `../../fooks-test-repos/...` that escapes the current repo root. Technically the path is still correct, but operationally it is a clawability gap: downstream consumers cannot tell whether this means "user intentionally extracted an external file", "path normalization leaked out of scope", or "the payload now references content outside the trusted working tree." That ambiguity is especially bad for model payloads because the `filePath` field looks like grounded provenance while actually encoding a cross-root escape. **Required fix shape:** (a) define a stable provenance contract for extracted targets outside cwd/repo root — for example an explicit `pathScope` / `targetRoot` field or an absolute-vs-relative policy instead of silently emitting `../..` escapes; (b) if relative paths are retained, add a machine-readable flag that the target is outside the current workspace/root; (c) document and test the normalization rule for sibling-repo extraction so downstream tooling does not mistake cross-root references for in-repo files; (d) add regression coverage for one in-repo fixture and one external-target fixture. **Why this matters:** model payload provenance should reduce ambiguity, not create a silent scope escape that later consumers have to reverse-engineer. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
+**Gap.** `PortRuntime.run_turn_loop` (`src/runtime.py:154`) bounds execution only by `max_turns` (a turn count). There is no wall-clock deadline or per-turn timeout. If a single `engine.submit_message` call stalls (e.g., waiting on a slow or hung external provider, a network timeout, or an infinite LLM stream), the entire turn loop hangs with no structured signal, no cancellation path, and no timeout error returned to the caller.
-203. **Successful dogfood runs can still end in a misleading TUI/pane failure banner (`skills/list failed in TUI`, `can't find pane`)** — dogfooded 2026-04-19 from `dogfood-1776184671`. The session completed real work and produced a coherent result summary, but immediately afterward the surface emitted `Error: skills/list failed in TUI` and `can't find pane: %4766`. That creates a truth-ordering bug: the user just watched a successful run, then the final visible state looks like a transport/UI failure with no indication whether the underlying task failed, the pane disappeared after completion, or an unrelated post-run TUI refresh crashed. **Required fix shape:** (a) separate task result state from post-run TUI/skills refresh failures so a completed run cannot be visually overwritten by a secondary pane-lookup error; (b) classify missing-pane-after-completion as a typed transport/UI degradation with phase context (`post_result_refresh`, `skills_list_refresh`, etc.) instead of a generic terminal error; (c) preserve and surface the last successful task outcome even if the TUI follow-up step fails; (d) add regression coverage for the path where a pane disappears after result rendering so the session is reported as `completed_with_ui_warning` rather than plain failure. **Why this matters:** claw-code needs the final visible truth to match the actual execution truth; otherwise successful dogfood looks flaky and operators cannot tell whether to trust the result they just got. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
+**Repro (conceptual).** Wrap `engine.submit_message` with an artificial `time.sleep(9999)` and call `run_turn_loop` — it blocks forever. There is no `asyncio.wait_for`, `signal.alarm`, `concurrent.futures.TimeoutError`, or equivalent in the call path. `grep -n 'timeout\|deadline\|elapsed\|wall' src/runtime.py src/query_engine.py` returns zero results.
-204. **Interactive work can start with updater/setup churn before the actual user task, blurring startup truth and first-action latency** — dogfooded 2026-04-19 from `clawcode-human`. Launching `omx` inside the claw-code worktree did not begin with the requested ROADMAP task; it first diverted through an update prompt (`Update available: v0.12.6 → v0.13.0. Update now? [Y/n]`), global install, full setup refresh, config rewrite/backups, notification/HUD setup, and a `Restart to use new code` notice before returning to the actual prompt. None of that was the operator’s requested work, but it consumed the critical startup window and mixed setup chatter with task-relevant execution. This creates a clawability gap: downstream observers cannot cleanly distinguish `startup succeeded and work began` from `startup mutated the environment and maybe changed the toolchain before work began`, and first-action latency gets polluted by maintenance side effects. **Required fix shape:** (a) make updater/setup detours a first-class startup phase with explicit classification (`startup.update_gate`, `startup.setup_refresh`) instead of letting them masquerade as normal task progress; (b) allow noninteractive or automation-oriented launches to suppress or defer update/setup churn until after the first user task/result boundary; (c) preserve a clean timestamped boundary between maintenance work and task work in lane events/status surfaces; (d) add regression coverage proving a prompt can start without forced updater/setup interposition when policy says "do work now." **Why this matters:** startup truth should reflect the user’s requested work, not hide it behind self-mutation and config churn that change latency, logs, and reproducibility before the first real action. Source: live dogfood session `clawcode-human` on 2026-04-19.
+**Impact.** A claw calling `run_turn_loop` in a CI pipeline or orchestration harness has no reliable way to enforce a deadline. The loop will hang until the OS kills the process or a human intervenes. The caller cannot distinguish "still running" from "hung" without an external watchdog.
-205. **Direct CLI dogfood is not self-starting when build artifacts are absent (`dist/cli/index.js` missing)** — dogfooded 2026-04-19 from `dogfood-1776184671`. The intended direct check was to run `node dist/cli/index.js extract ...`, but the first attempt hit a missing built artifact and the lane had to detour through `npm ci && npm run build` before any product behavior could be exercised. That means a "run the CLI directly in a fresh worktree" path is not actually one-step dogfoodable: the operator has to know the build prerequisite, spend time satisfying it, and then mentally separate build-system failures from product-surface failures. **Required fix shape:** (a) provide a supported direct-run entrypoint that either works from source without prebuilt `dist/` artifacts or emits a product-owned guidance error that names the exact one-shot bootstrap command; (b) surface build-artifact-missing as a typed startup/dependency prerequisite state rather than a raw module/file failure; (c) document and test the fresh-worktree direct-dogfood path so `extract --help` / `extract ... --model-payload` can be exercised without archaeology; (d) if build-on-demand is the intended contract, make it explicit and deterministic instead of requiring the operator to guess `npm ci && npm run build`. **Why this matters:** direct dogfood should fail on product behavior, not on hidden local build prerequisites that blur whether the tool is broken or merely unprepared. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
+**Fix shape (~15 lines).**
 1. Add an optional `timeout_seconds: float | None = None` parameter to `run_turn_loop`.
 2. Use `concurrent.futures.ThreadPoolExecutor` + `Future.result(timeout=...)` (or `asyncio.wait_for` if the engine becomes async) to wrap each `submit_message` call.
 3. On timeout, append a sentinel `TurnResult` with `stop_reason='timeout'` and break the loop.
 4. Document the timeout contract: total wall-clock budget across all turns, not per-turn.
-206. **`extract --help` is not a safe/local help surface: after bootstrap it can still crash into a Node stack instead of rendering usage** — dogfooded 2026-04-19 from `dogfood-1776184671`. Even after repairing the missing-build-artifact prerequisite with `npm ci && npm run build`, the next expected low-risk probe `node dist/cli/index.js extract --help` did not cleanly print command help; it dropped into a Node failure at `dist/cli/index.js:52` and emitted a stack trace under `Node.js v25.1.0`. That means the help path itself is not trustworthy as a preflight surface: operators cannot rely on `--help` to discover flags or confirm command shape before doing real work, and they have to treat a basic introspection command like a potentially crashing code path. **Required fix shape:** (a) make `extract --help` and sibling help surfaces intercept locally before any heavier runtime path that can throw; (b) if a subcommand cannot render help because build/runtime prerequisites are missing, return a product-owned guidance error instead of a raw Node stack; (c) add regression coverage that `extract --help` succeeds in both a prepared worktree and a minimally bootstrapped one; (d) preserve the contract that help/usage discovery is the safest command family, not another execution path that can explode. **Why this matters:** help commands are supposed to reduce uncertainty; if they crash, dogfooders lose the cleanest way to learn the surface and every later failure gets harder to classify. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
+**Acceptance.** `run_turn_loop(prompt, timeout_seconds=10)` raises `TimeoutError` (or returns a `TurnResult` with `stop_reason='timeout'`) within 10 seconds even if the underlying LLM call stalls indefinitely. `timeout_seconds=None` (default) preserves existing behaviour.
-207. **Build/setup failures are being misclassified as generic missing-path shell errors in post-tool feedback** — dogfooded 2026-04-19 from `dogfood-1776184671`. When the lane attempted `node dist/cli/index.js extract --help` with no built artifact, the `PostToolUse` hook summarized it as ``Bash reported `command not found`, `permission denied`, or a missing file/path``, and later `npm run build` failed with actual TypeScript diagnostics (`TS2307: Cannot find module 'typescript'`, plus additional compile errors). Those are distinct failure classes — missing built artifact, missing dependency, and compile/typecheck red — but the feedback surface collapses them into the same mushy shell-triage bucket. That makes recovery slower because the operator has to reread raw pane output to learn whether the right next move is `npm ci`, fixing package deps, fixing TS errors, or checking file paths. **Required fix shape:** (a) classify post-tool failures with narrower machine-readable buckets such as `artifact_missing`, `dependency_missing`, `compile_error`, and reserve `missing_path` / `command_not_found` for the literal cases; (b) include the strongest observed diagnostic snippet (for example `TS2307 typescript missing`) in the structured feedback instead of only the broad shell rubric; (c) add regression coverage proving TypeScript/compiler failures are not surfaced as generic missing-path errors; (d) thread that typed classification into lane summaries so downstream claws can recommend the right recovery without pane archaeology. **Why this matters:** clawability depends on the fix suggestion matching the real failure class; broad shell-error mush turns easy recoveries into manual forensic work. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.
+**Blocker.** None.
 208. **The JavaScript `extract` dogfood path has no dedicated preflight/doctor surface for its own prerequisites** — dogfooded 2026-04-19 from `dogfood-1776184671`. The repo already has strong Rust-side `claw doctor` / preflight coverage, but the direct JS CLI path I was actually dogfooding (`node dist/cli/index.js extract ...`) gave no equivalent early warning about its own prerequisites: missing `dist/cli/index.js`, missing `node_modules/typescript`, and the difference between "needs bootstrap" vs "real compile error" all had to be discovered by failing real commands in sequence. That means the lowest-friction way to validate the JS extract surface is still failure-driven archaeology rather than one explicit readiness check. **Required fix shape:** (a) add a lightweight JS-side preflight/doctor command or bootstrap check for the extract CLI path that reports artifact presence, dependency readiness, and build status before execution; (b) make that check machine-readable so lanes can say `js_extract_prereq_blocked` (or equivalent) instead of learning via stack traces; (c) document the direct dogfood path so operators know whether the supported sequence is `doctor -> help -> extract` or something else; (d) add regression coverage for a fresh worktree, a deps-missing worktree, and a ready worktree. **Why this matters:** preflight should collapse obvious prerequisite failures into one cheap truth surface instead of forcing dogfooders to burn turns discovering them one crash at a time. Source: live dogfood session `dogfood-1776184671` on 2026-04-19.