ROADMAP #112: concurrent /compact and /clear race with raw 'No such file or directory (os error 2)' on session file

Dogfooded 2026-04-18 on main HEAD a049bd2 from /tmp/cdII.

5 concurrent /compact on same session → 4 succeed, 1 races with
raw ENOENT. Same pattern with concurrent /clear --confirm.

Trace:
  session.rs:204-212 save_to_path:
    rotate_session_file_if_needed(path)?
    write_atomic(path, &snapshot)?
    cleanup_rotated_logs(path)?
  Three steps. No lock around sequence.

  session.rs:1085-1094 rotate_session_file_if_needed:
    metadata(path) → rename(path, rot_path)
  Classic TOCTOU. Race window between check and rename.

  session.rs:1063-1071 write_atomic:
    writes .tmp-{ts}-{counter}, renames to path
  Atomic per rename, not per multi-step sequence.

  cleanup_rotated_logs deletes .rot-{ts} files older than 3 most
  recent. Can race against another process reading that rot file.

  No flock, no advisory lock file, no fcntl.
  grep 'flock|FileLock|advisory' session.rs → zero matches.

  SessionError::Io Display forwards os::Error Display:
    'No such file or directory (os error 2)'
  No domain translation to 'session file vanished during save'
  or 'concurrent modification detected, retry safe'.

Fix shape (~90 lines + test):
- advisory lock: .claw/sessions/<bucket>/<session>.jsonl.lock
  exclusive flock for duration of save_to_path (fs2 crate)
- domain error variants:
    SessionError::ConcurrentModification {path, operation}
    SessionError::SessionFileVanished {path}
- error-to-JSON mapping:
    {error_kind: 'concurrent_modification', retry_safe: true}
- retry-policy hints on idempotent ops (/compact, /clear)
- regression test: spawn 10 concurrent /compact, assert all
  success OR structured ConcurrentModification (no raw os_error)

Affected operations:
- /compact (session save_to_path after compaction)
- /clear --confirm (save_to_path after new session)
- /export (may hit rotation boundary)
- Turn-persist (append_persisted_message can race rotation)

Not inherently a bug if sessions are single-writer, but
workspace-bucket scoping at session_control.rs:31-32 assumes
one claw per workspace. Parallel ulw lanes, CI matrix runners,
orchestration loops all violate that assumption.

Joins truth-audit (error lies by omission about what happened).
New micro-cluster 'session handling' with #93. Adjacent to
#104 on session-file-handling axis.

Natural bundle: #93 + #112 (session semantic correctness +
concurrency error clarity).

Filed in response to Clawhip pinpoint nudge 1494880177099116586
in #clawcode-building-in-public.
This commit is contained in:
YeonGyu-Kim 2026-04-18 11:03:12 +09:00
parent a049bd29b1
commit 8b25daf915

View File

@ -3255,3 +3255,83 @@ ear], /color [scheme], /effort [low|medium|high], /fast, /summary, /tag [label],
**Blocker.** None. The choice (implement vs remove) is the only architectural decision. Runtime has enough scaffolding that implementing is ~60 lines. Removing is ~3 lines.
**Source.** Jobdori dogfood 2026-04-18 against `/tmp/cdHH` on main HEAD `b2366d1` in response to Clawhip pinpoint nudge at `1494872623782301817`. Joins **silent-flag / documented-but-unenforced** (#96#101, #104, #108) on the command-dispatch-semantics axis — eighth instance of "documented behavior differs from actual." Joins **unplumbed-subsystem / CLI-advertised-but-unreachable** (#78, #96, #100, #102, #103, #107, #109) as the eighth surface where the spec advertises a capability the implementation doesn't deliver. Joins **truth-audit / diagnostic-integrity** (#80#87, #89, #100, #102, #103, #105, #107, #109, #110) — `/providers` silently returns doctor output under the wrong kind label; help lies about capability. Natural bundle: **#78 + #96 + #111** — three-way "declared but not implemented as declared" triangle (CLI route never constructed + help resume-safe leaks stubs + slash command dispatches to wrong handler). Also **#96 + #108 + #111** — full `--help`/dispatch surface hygiene quartet covering help-filter-leaks + subcommand typo fallthrough + slash-command mis-dispatch. Session tally: ROADMAP #111.
112. **Concurrent claw invocations that touch the same session file (e.g. two `/clear --confirm` or two `/compact` calls on the same session-id race) fail intermittently with a raw OS errno — `{"type":"error","error":"No such file or directory (os error 2)"}` — instead of a domain-specific concurrent-modification error. There is no file locking, no read-modify-write protection, no rename-race guard. The loser of the race gets ENOENT because the winner rotated, renamed, or deleted the session file between the loser's `fs::read_to_string` and its own `fs::write`. A claw orchestrating multiple lanes that happen to share a session id (because the operator reuses one, or because a CI matrix is re-running with the same state) gets unpredictable partial failures with un-actionable raw-io errors** — dogfooded 2026-04-18 on main HEAD `a049bd2` from `/tmp/cdII`. Five concurrent `/compact` calls on the same session: 4 succeed, 1 fails with `os error 2`. Two concurrent `/clear --confirm` calls: same pattern.
**Concrete repro.**
```
$ cd /tmp/cdII && git init -q .
$ # ... set up a minimal session at .claw/sessions/<bucket>/s.jsonl ...
# Race 5 concurrent /compact calls:
$ for i in 1 2 3 4 5; do
> claw --resume s --output-format json /compact >/tmp/c$i.log 2>&1 &
> done
$ wait
$ for i in 1 2 3 4 5; do echo "$i: $(head -c 80 /tmp/c$i.log)"; done
1: { ... successful compact
2: {"command":"/compact","error":"No such file or directory (os error 2)","type":"error"}
3: { ... successful compact
4: { ... successful compact
5: { ... successful compact
# 4 succeed, 1 races and gets raw ENOENT
# Same with /clear:
$ claw --resume s --output-format json /clear --confirm >/tmp/r1.log 2>&1 &
$ claw --resume s --output-format json /clear --confirm >/tmp/r2.log 2>&1 &
$ wait; cat /tmp/r1.log /tmp/r2.log
{"kind":"clear","backup":"...",...}
{"command":"/clear --confirm","error":"No such file or directory (os error 2)","type":"error"}
```
**Trace path.**
- `rust/crates/runtime/src/session.rs:204-212``save_to_path`:
```rust
pub fn save_to_path(&self, path: impl AsRef<Path>) -> Result<(), SessionError> {
let path = path.as_ref();
let snapshot = self.render_jsonl_snapshot()?;
rotate_session_file_if_needed(path)?; // may rename path → path.rot-{ts}
write_atomic(path, &snapshot)?; // writes tmp, renames tmp → path
cleanup_rotated_logs(path)?; // deletes older rot files
Ok(())
}
```
Three steps: rotate (rename) + write_atomic (tmp + rename) + cleanup (deletes). No lock around the sequence.
- `rust/crates/runtime/src/session.rs:1063-1071``write_atomic` creates `temp_path` = `{path}.tmp-{ts}-{counter}`, writes, renames to `path`. Atomic *per rename* but not *per multi-step sequence*. A concurrent `rotate_session_file_if_needed` between another process's read and write races here.
- `rust/crates/runtime/src/session.rs:1085-1094``rotate_session_file_if_needed`:
```rust
let Ok(metadata) = fs::metadata(path) else {
return Ok(());
};
if metadata.len() < ROTATE_AFTER_BYTES {
return Ok(());
}
let rotated_path = rotated_log_path(path);
fs::rename(path, rotated_path)?; // race window: another process read-holding `path`
Ok(())
```
Classic TOCTOU: `metadata()` then `rename()` with no guard.
- `rust/crates/runtime/src/session.rs:1105-1140``cleanup_rotated_logs` deletes `.rot-{ts}` files older than the 3 most recent. Another process reading a rot file can race against the deletion.
- `rust/crates/runtime/src/session.rs` — no `fcntl`, no `flock`, no advisory lock file. `grep -rn 'flock\|FileLock\|advisory' rust/crates/runtime/src/session.rs` returns zero matches.
- `rust/crates/rusty-claude-cli/src/main.rs` error formatter (`main.rs:2222-2232` / equivalent) catches the SessionError and formats via `to_string()`, which for `SessionError::Io(...)` just emits the underlying io::Error `Display` — which is `"No such file or directory (os error 2)"`. No domain translation to "session file was concurrently modified; retry" or similar.
**Why this is specifically a clawability gap.**
1. *Un-actionable error.* `"No such file or directory (os error 2)"` tells the claw nothing about what to do. A claw's error handler cannot distinguish "session file doesn't exist" (pre-session) from "session file race-disappeared" (concurrent write) from "session file was deleted out-of-band" (housekeeping) — all three surface with the same ENOENT message.
2. *Not inherently a bug if sessions are single-writer* — but the per-workspace-bucket scoping at `session_control.rs:31-32` assumes one claw per workspace. The moment two claws spawn in the same workspace (e.g., ulw-loop with parallel lanes, CI runners, multi-turn orchestration), they race.
3. *Session rotation amplifies the race.* `ROTATE_AFTER_BYTES = 256 * 1024`. A session growing past 256KB triggers rotation on next `save_to_path`. If two processes call `save_to_path` around the rotation boundary, one renames the file, the other's subsequent read fails.
4. *No advisory lock file.* Unix-standard `.claw/sessions/<bucket>/s.jsonl.lock` (exclusive flock) would serialize save_to_path operations with minimal overhead. The machinery exists in the ecosystem; claw-code doesn't use it.
5. *Error-to-diagnostic mapping incomplete.* `SessionError::Io(...)` has a Display impl that just forwards the os::Error message. A domain-aware translation layer would convert common concurrent-access failures into actionable "retry-safe" / "session-modified-externally" categories.
6. *Joins truth-audit cluster on error-quality axis.* The session file WAS modified (it was deleted-then-recreated by process 1), but the error says "No such file or directory" — not "the file you were trying to save was deleted or rotated during your save." The error lies by omission about what actually happened.
**Fix shape — advisory locking + domain-specific error classes + retry guidance.**
1. *Add an advisory lock file.* `.claw/sessions/<bucket>/<session>.jsonl.lock`. Take an exclusive `flock` (via `fs2` crate or libc `fcntl`) for the duration of `save_to_path`. ~30 lines. Covers rotation + write + cleanup as an atomic sequence from other claw-code processes' perspective.
2. *Introduce domain-specific error variants.* `SessionError::ConcurrentModification { path, operation }` when a `fs::rename` source path vanishes between metadata check and rename. `SessionError::SessionFileVanished { path }` when `fs::read_to_string` returns ENOENT after a successful session-existence check. ~25 lines.
3. *Map errors at the JSON envelope.* When the CLI catches `SessionError::ConcurrentModification`, emit `{"type":"error","error_kind":"concurrent_modification","message":"..","retry_safe":true}` instead of a raw io-error string. ~20 lines.
4. *Retry policy for idempotent operations.* `/compact` and `/clear` that fail with `ConcurrentModification` are safe to retry — emit a structured retry hint. `/export` that fails at write time is not safe to retry without clobbering — explicit `retry_safe: false`. ~15 lines.
5. *Regression test.* Spawn 10 concurrent `/compact` processes on a single session file. Assert: all succeed, OR any failures are structured `ConcurrentModification` errors (no raw `os error 2`). Use `tempfile` + `rayon` or tokio join_all. ~50 lines of test harness.
**Acceptance.** `for i in 1..5; do claw --resume s /compact & done; wait` produces either all successes or structured `{"error_kind":"concurrent_modification","retry_safe":true,...}` errors — never a raw `"No such file or directory (os error 2)"`. Advisory lock serializes save_to_path. Domain errors are actionable by claw orchestrators.
**Blocker.** None. Advisory locking is a well-worn pattern; `fs2` crate is already in the Rust ecosystem. Domain error mapping is additive. The architectural decision is whether to serialize at the save boundary (simpler, some perf cost) or implement a full MVCC-style session store (far more work, out of scope).
**Source.** Jobdori dogfood 2026-04-18 against `/tmp/cdII` on main HEAD `a049bd2` in response to Clawhip pinpoint nudge at `1494880177099116586`. Joins **truth-audit / diagnostic-integrity** (#80#87, #89, #100, #102, #103, #105, #107, #109, #110) — the error message lies about what actually happened (file vanished via concurrent rename, not intrinsic absence). Joins **Session handling** as a new micro-cluster (only existing member was #93 — reference-resolution semantics). Natural bundle: **#93 + #112** — session semantic correctness (reference resolution + concurrent-modification error clarity). Adjacent to **#104** (two-paths-diverge export) on the session-file-handling axis: #104 says the two export paths disagree on filename; #112 says concurrent session-file writers race with no advisory lock. Together session-handling has filename-semantic + concurrency gaps that the test suite should cover. Session tally: ROADMAP #112.