Troubleshooting

Upstream stream-init failures (`500 empty_stream`)

Symptom: claw-code exits with 500 empty_stream: upstream stream closed before first payload or similar upstream stream-init error.

Root cause: Upstream provider (Anthropic, OpenAI, other) closed the HTTP connection before sending the first response payload. Common causes:

Transient network issue between claw-code and provider
Provider overload / temporary service degradation
Authentication token expired or invalid
Rate limit exceeded (even if not visible in response headers)

Mitigation:

Check credentials: Verify claw whoami shows the expected provider and account. Re-authenticate if expired.
Wait and retry: Provider transient issues usually resolve within 30-60 seconds. Wait a minute, then retry the same command.
Check provider status: Visit the provider's status page (e.g., status.anthropic.com, status.openai.com).
Reduce request size: If the prompt is large, try a smaller request first to isolate stream-init from context-window failures.
Check network: Ensure your network connection is stable. If behind a proxy, verify proxy allows streaming responses.

When to escalate:

If stream-init failures persist >10 minutes across multiple requests
If claw whoami fails to authenticate
If no provider status page shows degradation

Related pinpoint: #290 (typed stream-init failure envelope — future improvement for better diagnostics)

Context-window-blocked errors

Symptom: claw-code exits with context_window_blocked or similar provider error when resuming a long session, or when sending a request with a very large prompt + accumulated history.

Root cause: Session size exceeded provider context window before claw-code's auto-compaction could reduce it. Auto-compaction is currently REACTIVE-AFTER-SUCCESS — it only fires after a successful provider response. If the request itself is oversized, compaction never runs.

Mitigation:

Resume with manual compact: claw resume <session> --compact-before (if available); else manually compact via /compact slash command before retrying
Start a fresh session: Sometimes the cleanest path; existing session-state preserved in ~/.claw/sessions/<id>/
Reduce prompt size: If interactive, send shorter prompts; truncate file contents before pasting
Adjust threshold: Lower CLAW_AUTO_COMPACT_INPUT_TOKENS_THRESHOLD env var (default varies by provider)

Related pinpoints: #287 (auto-compaction reactive-not-preflight, CRITICAL), #283 (threshold env-only no settings.json key), #288 (failure envelope omits diagnostics)

Manual `/compact` reports "session below compaction threshold"

Symptom: You run /compact to manually compact a session, but it reports session below compaction threshold even though the session feels large.

Root cause: The "below threshold" message is currently a catch-all for multiple skip reasons:

Too few compactable messages
Already compacted (only summary remains)
Compactable tokens below threshold
Tool-use/tool-result boundary preserved
Live vs resume threshold divergence

Mitigation:

Check session state: claw session info <id> to inspect message count, total tokens
Force compaction: Currently no --force flag exists; track #289 for typed skip-reason discriminants
Workaround: Continue session and let auto-compact fire after next provider response (when reactive-after-success path is available)

Related pinpoint: #289 (manual /compact skip-reason flattened, lacks typed discriminants)

Parallel agent stuck in "running" state

Symptom: A parallel agent lane shows status: running indefinitely, never transitioning to completed or error. Downstream coordination treats it as still-working.

Root cause: Agent::execute_agent writes a running manifest BEFORE spawning a detached std::thread::spawn. The JoinHandle is dropped. If the process crashes during agent execution, the manifest stays as running forever (zombie state). No heartbeat or stale-reaper exists.

Mitigation:

Manual cleanup: Inspect ~/.claw/agents/<lane>/ and remove stale manifest.json files where last-modified > N minutes ago
Restart agent lane: claw agent restart <lane>
Kill orphaned processes: pgrep claw to find lingering processes

Related pinpoint: #286 (Parallel Agent detached-thread no-heartbeat no-reaper)

Sustained upstream provider failures (`500 empty_stream` repeating)

Symptom: Same upstream provider error (e.g., 500 empty_stream: upstream stream closed before first payload) repeats 5+ times in <60 minutes. Retries hit the same dead upstream blindly.

Root cause: claw-code does NOT detect repeat-failure patterns. No circuit-breaker. No automatic provider-fallback when configured. Each retry attempts the same provider+endpoint regardless of recent failure history.

Mitigation:

Manual circuit-breaker: Wait 5-10 minutes after repeated failures before retrying
Switch provider: If you have multiple providers configured (ANTHROPIC_API_KEY + OPENAI_API_KEY), restart with different model prefix (e.g., gpt-4 instead of claude-)
Check provider status pages: status.anthropic.com, status.openai.com
Verify upstream endpoint: If using a proxy (CCAPI, custom OpenAI-compatible endpoint), check proxy logs

Related pinpoints: #291 (no repeat-failure detection / circuit-breaker), #285 (declarative providers config for fallback), #290 (stream-init failure envelope)

Other common failures

[placeholder for future sections: tool-use failures, session corruption]

5.6 KiB Raw Blame History

Troubleshooting

Upstream stream-init failures (500 empty_stream)

Context-window-blocked errors

Manual /compact reports "session below compaction threshold"

Parallel agent stuck in "running" state

Sustained upstream provider failures (500 empty_stream repeating)

Other common failures

5.6 KiB

Raw Blame History

Upstream stream-init failures (`500 empty_stream`)

Manual `/compact` reports "session below compaction threshold"

Sustained upstream provider failures (`500 empty_stream` repeating)