19 KiB
Execute a workflow script that orchestrates multiple subagents deterministically. Workflows run in the background — this tool returns immediately with a task ID, and a arrives when the workflow completes. Use /workflows to watch live progress.
A workflow structures work across many agents — to be comprehensive (decompose and cover in parallel), to be confident (independent perspectives and adversarial checks before committing), or to take on scale one context can't hold (migrations, audits, broad sweeps). The script is where you encode that structure: what fans out, what verifies, what synthesizes.
ONLY call this tool when the user has explicitly opted into multi-agent orchestration. Workflows can spawn dozens of agents and consume a large amount of tokens; the user must request that scale, not have it inferred. Explicit opt-in means one of:
- The user included the "workflow" or "workflows" keyword (you'll see a system-reminder confirming it).
- Ultracode is on (a system-reminder confirms it) — see Ultracode below.
- The user directly asked you to run a workflow or use multi-agent orchestration in their own words ("run a workflow", "fan out agents", "orchestrate this with subagents"). The ask must be in the user's words — a task that would merely benefit from a workflow does not count.
- The user invoked a skill or slash command whose instructions tell you to call Workflow.
- The user asked you to run a specific named or saved workflow.
For any other task — even one that would clearly benefit from parallelism — do NOT call this tool. Use the Agent tool for individual subagents, or briefly describe what a multi-agent workflow could do and how much it would roughly cost, and ask the user whether to run it. Mention they can include "workflow" in a future message to skip the ask.
When you do call it, the right move is often hybrid: scout inline first (list the files, find the channels, scope the diff) to discover the work-list, then call Workflow to pipeline over it. You don't need to know the shape before the task — only before the orchestration step.
Common single-phase workflows you can chain across turns:
- Understand — parallel readers over relevant subsystems → structured map
- Design — judge panel of N independent approaches → scored synthesis
- Review — dimensions → find → adversarially verify (example below)
- Research — multi-modal sweep → deep-read → synthesize
- Migrate — discover sites → transform each (worktree isolation) → verify
For larger work, run several in sequence — read each result before deciding the next phase. You stay in the loop; each workflow is one well-scoped fan-out.
Ultracode. When a system-reminder confirms ultracode is on, that opt-in is standing: author and run a workflow for every substantive task by default. The goal is the most exhaustive, correct answer you can produce — token cost is not a constraint. For multi-phase work (understand → design → implement → review), that often means several workflows in sequence — one per phase — so you stay in the loop between them. The quality patterns below (adversarial verify, multi-modal sweep, completeness critic, loop-until-dry) are the tools; pick what fits the task. Lean toward orchestrating with workflows and adversarially verifying your findings — unless the work is trivial or already verified. Solo only on conversational turns or trivial mechanical edits. When a reminder says ultracode is off, revert to the opt-in rule above.
Pass the script inline via script — do not Write it to a file first. Every${WORKFLOW_TOOL_NAME} invocation automatically persists its script to a file under the session directory and returns the path in the tool result. To iterate on a workflow, edit that file with Write/Edit and re-invoke Workflow with {scriptPath: "<path>"} instead of resending the full script.${WORKFLOW_SCRIPT_PATH_NOTE}
Every script must begin with export const meta = {...}:
export const meta = {
name: 'find-flaky-tests',
description: 'Find flaky tests and propose fixes', // one-line, shown in permission dialog
phases: [ // one entry per phase() call
{ title: 'Scan', detail: 'grep test logs for retries' },
{ title: 'Fix', detail: 'one agent per flaky test' },
],
}
// script body starts here — use agent()/parallel()/pipeline()/phase()/log()
phase('Scan')
const flaky = await agent('grep CI logs for retry markers', {schema: FLAKY_SCHEMA})
...
The meta object must be a PURE LITERAL — no variables, function calls, spreads, or template interpolation. Required fields: name, description. Optional: whenToUse (shown in the workflow list), phases. Use the SAME phase titles in meta.phases as in phase() calls — titles are matched exactly; a phase() call with no matching meta entry just gets its own progress group. Add model to a phase entry when that phase uses a specific model override.
Script body hooks:
- agent(prompt: string, opts?: {label?: string, phase?: string, schema?: object, model?: string, isolation?: ${WORKFLOW_AGENT_ISOLATION_OPTION}, agentType?: string}): Promise — spawn a subagent. Without schema, returns its final text as a string. With schema (a JSON Schema), the subagent is forced to call a StructuredOutput tool and agent() returns the validated object — no parsing needed. Returns null if the user skips the agent mid-run (filter with .filter(Boolean)). opts.label overrides the display label. opts.phase explicitly assigns this agent to a progress group (use this inside pipeline()/parallel() stages to avoid races on the global phase() state — same phase string → same group box). opts.model overrides the model for this agent call. Default to omitting it — the agent inherits the main-loop model (the resolved session model), which is almost always correct. Only set it when you're highly confident a different tier fits the task; when unsure, omit. opts.isolation: 'worktree' runs the agent in a fresh git worktree — EXPENSIVE (~200-500ms setup + disk per agent), use ONLY when agents mutate files in parallel and would otherwise conflict; the worktree is auto-removed if unchanged.${WORKFLOW_AGENT_ISOLATION_NOTE} opts.agentType uses a custom subagent type (e.g. 'Explore', 'code-reviewer') instead of the default workflow subagent — resolved from the same registry as the Agent tool; composes with schema (the custom agent's system prompt gets a StructuredOutput instruction appended).
- pipeline(items, stage1, stage2, ...): Promise<any[]> — run each item through all stages independently, NO barrier between stages. Item A can be in stage 3 while item B is still in stage 1. This is the DEFAULT for multi-stage work. Wall-clock = slowest single-item chain, not sum-of-slowest-per-stage. Every stage callback receives (prevResult, originalItem, index) — use originalItem/index in later stages to label work without threading context through stage 1's return value. A stage that throws drops that item to
nulland skips its remaining stages. - parallel(thunks: Array<() => Promise>): Promise<any[]> — run tasks concurrently. This is a BARRIER: awaits all thunks before returning. A thunk that throws (or whose agent errors) resolves to
nullin the result array — the call itself never rejects, so.filter(Boolean)before using the results. Use ONLY when you genuinely need all results together. - log(message: string): void — emit a progress message to the user (shown as a narrator line above the progress tree)
- phase(title: string): void — start a new phase; subsequent agent() calls are grouped under this title in the progress display
- args: any — the value passed as Workflow's
argsinput, verbatim (undefined if not provided). Pass arrays/objects as actual JSON values in the tool call, NOT as a JSON-encoded string —args: ["a.ts", "b.ts"], notargs: "[\"a.ts\", ...]"(a stringified list reaches the script as one string, soargs.filter/args.mapthrow). Use this to parameterize named workflows — e.g. pass a research question, target path, or config object directly instead of via a side-channel file. - budget: {total: number|null, spent(): number, remaining(): number} — the turn's token target from the user's "+500k"-style directive.
budget.totalis null if no target was set.budget.spent()returns output tokens spent this turn across the main loop and all workflows — the pool is shared, not per-workflow.budget.remaining()returnsmax(0, total - spent()), orInfinityif no target. The target is a HARD ceiling, not advisory: oncespent()reachestotal, furtheragent()calls throw. Use for dynamic loops:while (budget.total && budget.remaining() > 50_000) { ... }, or static scaling:const FLEET = budget.total ? Math.floor(budget.total / 100_000) : 5. - workflow(nameOrRef: string | {scriptPath: string}, args?: any): Promise — run another workflow inline as a sub-step and return whatever it returns. Pass a name to invoke a saved workflow (same registry as {name: "..."}), or {scriptPath} to run a script file you Wrote earlier. The child shares this run's concurrency cap, agent counter, abort signal, and token budget — its agents appear under a "${WORKFLOW_GROUP_PREFIX} name" group in /workflows and its tokens count toward budget.spent(). The args param becomes the child's
argsglobal. Nesting is one level only: workflow() inside a child throws. Throws on unknown name / unreadable scriptPath / child syntax error; catch to handle gracefully.
Subagents are told their final text IS the return value (not a human-facing message), so they return raw data. For structured output, use the schema option — validation happens at the tool-call layer so the model retries on mismatch.
Workflow agents can reach all session-connected MCP tools via ToolSearch — schemas load on demand per agent. Caveat: interactively-authenticated MCP servers (e.g. claude.ai) may be absent in headless/cron runs.
Scripts are plain JavaScript, NOT TypeScript — type annotations (: string[]), interfaces, and generics fail to parse. The script body runs in an async context — use await directly. Standard JS built-ins (JSON, Math, Array, etc.) are available — EXCEPT Date.now()/Math.random()/argless new Date(), which throw (they would break resume); pass timestamps in via args, stamp results after the workflow returns, and for randomness vary the agent prompt/label by index. No filesystem or Node.js API access.
DEFAULT TO pipeline(). Only reach for a barrier (parallel between stages) when you genuinely need ALL prior-stage results together.
A barrier is correct ONLY when stage N needs cross-item context from all of stage N-1:
- Dedup/merge across the full result set before expensive downstream work
- Early-exit if the total count is zero ("0 bugs found → skip verification entirely")
- Stage N's prompt references "the other findings" for comparison
A barrier is NOT justified by:
- "I need to flatten/map/filter first" — do it inside a pipeline stage: pipeline(items, stageA, r => transform([r]).flat(), stageB)
- "The stages are conceptually separate" — that's what pipeline() models. Separate stages ≠ synchronized stages.
- "It's cleaner code" — barrier latency is real. If 5 finders run and the slowest takes 3× the fastest, a barrier wastes 2/3 of the fast finders' idle time.
Smell test: if you wrote const a = await parallel(...) const b = transform(a) // flatten, map, filter — no cross-item dependency const c = await parallel(b.map(...)) that middle transform doesn't need the barrier. Rewrite as a pipeline with the transform inside a stage. When in doubt: pipeline.
Concurrent agent() calls are capped at min(16, cpu cores - 2) per workflow — excess calls queue and run as slots free up. You can still pass 100 items to parallel()/pipeline() and they all complete; only ~10 run at any moment. Total agent count across a workflow's lifetime is capped at 1000 — a runaway-loop backstop set far above any real workflow.
The canonical multi-stage pattern — pipeline by default, each dimension verifies as soon as its review completes:
export const meta = {
name: 'review-changes',
description: 'Review changed files across dimensions, verify each finding',
phases: [{ title: 'Review' }, { title: 'Verify' }],
}
const DIMENSIONS = [{key: 'bugs', prompt: '...'}, {key: 'perf', prompt: '...'}]
const results = await pipeline(
DIMENSIONS,
d => agent(d.prompt, {label: review:${d.key}, phase: 'Review', schema: FINDINGS_SCHEMA}),
review => parallel(review.findings.map(f => () =>
agent(Adversarially verify: ${f.title}, {label: verify:${f.file}, phase: 'Verify', schema: VERDICT_SCHEMA})
.then(v => ({...f, verdict: v}))
))
)
const confirmed = results.flat().filter(Boolean).filter(f => f.verdict?.isReal)
return { confirmed }
// Dimension 'bugs' findings verify while dimension 'perf' is still reviewing. No wasted wall-clock.
When a barrier IS correct — dedup across all findings before expensive verification: const all = await parallel(DIMENSIONS.map(d => () => agent(d.prompt, {schema: FINDINGS_SCHEMA}))) const deduped = dedupeByFileAndLine(all.filter(Boolean).flatMap(r => r.findings)) // <-- genuinely needs ALL at once const verified = await parallel(deduped.map(f => () => agent(verifyPrompt(f), {schema: VERDICT_SCHEMA})))
Loop-until-count pattern — accumulate to a target:
const bugs = []
while (bugs.length < 10) {
const result = await agent("Find bugs in this codebase.", {schema: BUGS_SCHEMA})
bugs.push(...result.bugs)
log(${bugs.length}/10 found)
}
Loop-until-budget pattern — scale depth to the user's "+500k" directive. Guard on budget.total: with no target set, remaining() is Infinity and the loop would run straight to the 1000-agent cap.
const bugs = []
while (budget.total && budget.remaining() > 50_000) {
const result = await agent("Find bugs in this codebase.", {schema: BUGS_SCHEMA})
bugs.push(...result.bugs)
log(${bugs.length} found, ${Math.round(budget.remaining()/1000)}k remaining)
}
Composing patterns — exhaustive review (find → dedup vs seen → diverse-lens panel → loop-until-dry):
const seen = new Set(), confirmed = []
let dry = 0
while (dry < 2) { // loop-until-dry
const found = (await parallel(FINDERS.map(f => () => // barrier: collect all finders this round
agent(f.prompt, {phase: 'Find', schema: BUGS})))).filter(Boolean).flatMap(r => r.bugs)
const fresh = found.filter(b => !seen.has(key(b))) // dedup vs ALL seen — plain code, not an agent
if (!fresh.length) { dry++; continue }
dry = 0; fresh.forEach(b => seen.add(key(b)))
const judged = await parallel(fresh.map(b => () => // every fresh bug judged concurrently...
parallel(['correctness','security','repro'].map(lens => () => // ...each by 3 distinct lenses
agent(Judge "${b.desc}" via the ${lens} lens — real?, {phase: 'Verify', schema: VERDICT})))
.then(vs => ({ b, real: vs.filter(Boolean).filter(v => v.real).length >= 2 }))))
confirmed.push(...judged.filter(v => v.real).map(v => v.b))
}
return confirmed
// dedup vs seen, NOT confirmed — else judge-rejected findings reappear every round and it never converges.
Quality patterns — common shapes; pick by task and compose freely:
- Adversarial verify: spawn N independent skeptics per finding, each prompted to REFUTE. Kill if ≥majority refute. Prevents plausible-but-wrong findings from surviving.
const votes = await parallel(Array.from({length: 3}, () => () =>
agent(
Try to refute: ${claim}. Default to refuted=true if uncertain., {schema: VERDICT}))) const survives = votes.filter(Boolean).filter(v => !v.refuted).length >= 2 - Perspective-diverse verify: when a finding can fail in more than one way, give each verifier a distinct lens (correctness, security, perf, does-it-reproduce) instead of N identical refuters — diversity catches failure modes redundancy can't.
- Judge panel: generate N independent attempts from different angles (e.g. MVP-first, risk-first, user-first), score with parallel judges, synthesize from the winner while grafting the best ideas from runners-up. Beats one-attempt-iterated when the solution space is wide.
- Loop-until-dry: for unknown-size discovery (bugs, issues, edge cases), keep spawning finders until K consecutive rounds return nothing new. Simple counters (while count < N) miss the tail.
- Multi-modal sweep: parallel agents each searching a different way (by-container, by-content, by-entity, by-time). Each is blind to what the others surface; useful when one search angle won't find everything.
- Completeness critic: a final agent that asks "what's missing — modality not run, claim unverified, source unread?" What it finds becomes the next round of work.
- No silent caps: if a workflow bounds coverage (top-N, no-retry, sampling),
log()what was dropped — silent truncation reads as "covered everything" when it didn't.
Scale to what the user asked for. "find any bugs" → a few finders, single-vote verify. "thoroughly audit this" or "be comprehensive" → larger finder pool, 3–5 vote adversarial pass, synthesis stage. When unsure, lean toward thoroughness for research/review/audit requests and toward brevity for quick checks.
These patterns aren't exhaustive — compose novel harnesses when the task calls for it (tournament brackets, self-repair loops, staged escalation, whatever fits).
Use this tool for multi-step orchestration where control flow should be deterministic (loops, conditionals, fan-out) rather than model-driven.
Resume
The tool result includes a runId. To resume after a pause, kill, or script edit, relaunch with Workflow({scriptPath, resumeFromRunId}) — the longest unchanged prefix of agent() calls returns cached results instantly; the first edited/new call and everything after it runs live. Same script + same args → 100% cache hit. Date.now()/Math.random()/new Date() are unavailable in scripts (they would break this) — stamp results after the workflow returns, or pass timestamps via args. Fallback when no journal is available: Read agent-.jsonl files in the transcript directory and hand-author a continuation script.