Unify slot ownership: processKey() owns the slot until task.concurrencyKey is set.
Previously, startTask() released the slot on errors (session.create catch,
createResult.error), but processKey() catch block didn't release, causing slot
leaks when errors occurred between acquire() and task.concurrencyKey assignment.
Changes:
- Remove all pre-transfer release() calls in startTask()
- Add conditional release in processKey() catch: only if task.concurrencyKey not set
- Add validation for createResult.data?.id to catch malformed API responses
This fixes 'Task failed to start within timeout' errors caused by exhausted
concurrency slots that were never released.
- Change session title from 'Task: {desc}' to '{desc} (@{agent} subagent)'
- Move session_id to structured <task_metadata> block for better parsing
- Add category tracking to BackgroundTask type and LaunchInput
- Add tests for new title format and metadata block
- Add stubNotifyParentSession implementation to stub manager's notifyParentSession method
- Add stubNotifyParentSession calls to checkAndInterruptStaleTasks tests
- Add messages mock to client mocks for completeness
- Fix timer-based tests by using real timers (fakeTimers.restore) with wait()
- Increase timeout for tests that need real time delays
Track setTimeout timers in notifyParentSession using a completionTimers Map.
Clear all timers on shutdown() and when tasks are deleted via session.deleted.
This prevents the BackgroundManager instance from being held in memory by
uncancelled timer callbacks.
Fixes#1043
Co-authored-by: sisyphus-dev-ai <sisyphus-dev-ai@users.noreply.github.com>
* fix: prevent zombie processes with proper process lifecycle management
- Await proc.exited for fire-and-forget spawns in tmux-utils.ts
- Remove competing process.exit() calls from LSP client and skill-mcp-manager
signal handlers to let background-agent manager coordinate final exit
- Await process exit after kill() in interactive-bash timeout handler
- Await process exit after kill() in LSP client stop() method
These changes ensure spawned processes are properly reaped and prevent
orphan/zombie processes when running with tmux integration.
* fix: address Copilot review comments on process cleanup
- LSP cleanup: use async/sync split with Promise.allSettled for proper subprocess cleanup
- LSP stop(): make idempotent by nulling proc before await to prevent race conditions
- Interactive-bash timeout: use .then()/.catch() pattern instead of async callback to avoid unhandled rejections
- Skill-mcp-manager: use void+catch pattern for fire-and-forget signal handlers
* fix: address remaining Copilot review comments
- interactive-bash: reject timeout immediately, fire-and-forget zombie cleanup
- skill-mcp-manager: update comments to accurately describe signal handling strategy
* fix: address additional Copilot review comments
- LSP stop(): add 5s timeout to prevent indefinite hang on stuck processes
- tmux-utils: log warnings when pane title setting fails (both spawn/replace)
- BackgroundManager: delay process.exit() to next tick via setImmediate to allow other signal handlers to complete cleanup
* fix: address code review findings
- Increase exit delay from setImmediate to 100ms setTimeout to allow async cleanup
- Use asyncCleanup for SIGBREAK on Windows for consistency with SIGINT/SIGTERM
- Add try/catch around stderr read in spawnTmuxPane for consistency with replaceTmuxPane
* fix: address latest Copilot review comments
- LSP stop(): properly clear timeout when proc.exited wins the race
- BackgroundManager: use process.exitCode before delayed exit for cleaner shutdown
- spawnTmuxPane: remove redundant log import, reuse existing one
* fix: address latest Copilot review comments
- LSP stop(): escalate to SIGKILL on timeout, add logging
- tmux spawnTmuxPane/replaceTmuxPane: drain stderr immediately to avoid backpressure
* fix: address latest Copilot review comments
- Add .catch() to asyncCleanup() signal handlers to prevent unhandled rejections
- Await proc.exited after SIGKILL with 1s timeout to confirm termination
* fix: increase exit delay to 6s to accommodate LSP cleanup
LSP cleanup can take up to 5s (timeout) + 1s (SIGKILL wait), so the exit
delay must be at least 6s to ensure child processes are properly reaped.
* fix: exclude prompt/permission from plan agent config
plan agent should only inherit model settings from prometheus,
not the prompt or permission. This ensures plan agent uses
OpenCode's default behavior while only overriding the model.
* test(todo-continuation-enforcer): use FakeTimers for 15x faster tests
- Add custom FakeTimers implementation (~100 lines)
- Replace all real setTimeout waits with fakeTimers.advanceBy()
- Test time: 104.6s → 7.01s
* test(callback-server): fix race conditions with Promise.all and Bun.fetch
- Use Bun.fetch.bind(Bun) to avoid globalThis.fetch mock interference
- Use Promise.all pattern for concurrent fetch/waitForCallback
- Add Bun.sleep(10) in afterEach for port release
* test(concurrency): replace placeholder assertions with getCount checks
Replace 6 meaningless expect(true).toBe(true) assertions with
actual getCount() verifications for test quality improvement
* refactor(config-handler): simplify planDemoteConfig creation
Remove unnecessary IIFE and destructuring, use direct spread instead
* test(executor): use FakeTimeouts for faster tests
- Add custom FakeTimeouts implementation
- Replace setTimeout waits with fakeTimeouts.advanceBy()
- Test time reduced from ~26s to ~6.8s
* test: fix gemini model mock for artistry unstable mode
* test: fix model list mock payload shape
* test: mock provider models for artistry category
---------
Co-authored-by: justsisyphus <justsisyphus@users.noreply.github.com>
- BackgroundManager.shutdown() now aborts all running child sessions via
client.session.abort() before clearing state, preventing orphaned
opencode processes when parent exits
- Add onShutdown callback to BackgroundManager constructor, used to
trigger TmuxSessionManager.cleanup() on process exit signals
- Interactive bash session hook now aborts tracked subagent opencode
sessions when killing tmux sessions (defense-in-depth)
- Add 4 tests verifying shutdown abort behavior and callback invocation
Closes#1240
- Add permission: [{ permission: 'question', action: 'deny' }] to session.create()
in background-agent and delegate-task for SDK-level blocking
- Add subagent-question-blocker hook as backup layer to intercept question tool
calls in tool.execute.before event
- Ensures subagents cannot ask questions to users and must work autonomously
- Add session.status re-check in stability detection before completing
- Reset stablePolls if session is not idle (agent still working)
- Fix statusText to show CANCELLED instead of COMPLETED for non-completed tasks
fix: pass model parameter when resuming background tasks
Ensure resumed tasks maintain their original model configuration
from category settings, preventing unexpected model switching.
The resume method was not passing the stored model from the task,
causing Sisyphus-Junior to revert to the default model when resumed.
This fix adds the model to the prompt body in resume(), matching
the existing behavior in launch().
Fixes#826
OpenCode API returns different structures for user vs assistant messages:
- User: info.model = { providerID, modelID } (nested)
- Assistant: info.modelID, info.providerID (flat top-level)
Previous code only checked nested format, causing model info loss when
continuation hooks fired after assistant messages.
Files modified:
- todo-continuation-enforcer.ts
- ralph-loop/index.ts
- sisyphus-task/tools.ts
- background-agent/manager.ts
Added test for assistant message model extraction.
feat(concurrency): prevent background task races and leaks
Summary
Fixes race conditions and memory leaks in the background task system that could cause "all tasks complete" notifications to never fire, leaving parent sessions waiting indefinitely.
Why This Change?
When background tasks are tracked for completion notifications, the system maintains a pendingByParent map to know when all tasks for a parent session are done. Several edge cases caused "stale entries" to accumulate in this map:
1. Re-registering completed tasks added them back to pending tracking, but they'd never complete again
2. Changing a task's parent session left orphan entries in the old parent's tracking set
3. Concurrent task operations could cause double-acquisition of concurrency slots
These bugs meant the system would sometimes wait forever for tasks that were already done.
What Changed
- Concurrency management: Added proper acquire/release lifecycle with cleanup on process exit (SIGINT, SIGTERM)
- Parent session tracking: Fixed cleanup order. Now clears old parent's tracking before updating parent ID
- Stale entry prevention: Only tracks tasks that are actually running; actively cleans up completed tasks
- Renamed registerExternalTask → trackTask: Clearer name (the old name implied external API consumers, but it's internal)
Update BackgroundManager to rename the method for tracking external tasks, improving clarity and consistency in task management. Adjust related tests to reflect the new method name.
Previously, continuation hooks (todo-continuation, boulder-continuation, ralph-loop)
and background tasks resolved model info from filesystem cache, which could be stale
or missing. This caused session.prompt to fallback to default model (Sonnet) instead
of using the originally configured model (e.g., Opus).
Now all session.prompt calls first try API (session.messages) to get current model
info, with filesystem as fallback if API fails.
Affected files:
- todo-continuation-enforcer.ts
- sisyphus-orchestrator/index.ts
- ralph-loop/index.ts
- background-agent/manager.ts
- sisyphus-task/tools.ts
- hook-message-injector/index.ts (export ToolPermission type)
Enhance the BackgroundManager to properly clean up pending tasks when the parent session ID changes. This prevents stale entries in the pending notifications and ensures that the cleanup process is only executed when necessary, improving overall task management reliability.
Update BackgroundManager to track batched notifications only for running tasks. Implement cleanup for completed or cancelled tasks to avoid stale entries in pending notifications. Enhance logging to include task status for better debugging.
Add functionality to manage process cleanup by registering and unregistering signal listeners. This ensures that BackgroundManager instances properly shut down and remove their listeners on process exit. Introduce tests to verify listener removal after shutdown.
Ensure queue waiters settle once, centralize completion with status guards, and release slots before async work so shutdown and cancellations don’t leak concurrency. Internal hardening only.
Instead of using stale parentModel/parentAgent from task state, now
dynamically looks up the current message to get fresh model/agent values.
Applied across all prompt injection points:
- background-agent notifyParentSession
- ralph-loop continuation
- sisyphus-orchestrator boulder continuation
- sisyphus-task resume
🤖 Generated with [OhMyOpenCode](https://github.com/code-yeongyu/oh-my-opencode) assistance