diff --git a/ROADMAP.md b/ROADMAP.md index 6462cc1..35cc209 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -16184,3 +16184,100 @@ Dogfooded 2026-04-26 04:03 KST on branch `feat/jobdori-168c-emission-routing` af This is a sibling fusion shape to #225 but with image-generation-specific transport/output semantics: Anthropic does not offer native image generation and delegates users to external partners, while OpenAI offers first-class `/v1/images/*` endpoints and Google/partner ecosystems offer Imagen / Stability AI / Midjourney / Black Forest Labs / Ideogram-style generation lanes. `/v1/images/generations` is JSON-in with URL/base64 JSON-out, while `/v1/images/edits` and `/v1/images/variations` require multipart image/mask upload plumbing, so the fix inherits #223/#225's multipart transport axis without #225's full-duplex audio content-block symmetry. The missing taxonomy blocks canonical coding-agent workflows such as “generate UI mockup / asset / diagram from prompt”, “edit screenshot/mockup with mask”, and “return generated image artifacts with stable provenance instead of prose-only descriptions.” Required fix shape: (a) add typed request/response structs for image generation, edit, and variation endpoints, including model, prompt, size, quality, style, response format, background/transparent-output options where supported, and generated-image provenance metadata; (b) extend provider capabilities with explicit unsupported/recommendation returns for Anthropic and OpenAI/partner implementations for image endpoints; (c) add multipart transport support for edit/variation image+mask uploads if not already landed by Files/Audio work; (d) expose CLI and slash-command surfaces that distinguish image input (#220) from image output generation (#226); (e) add pricing/model-registry coverage for `gpt-image-1`, `dall-e-3`, `dall-e-2`, Imagen/partner equivalents, and generated-image usage accounting; (f) add regression coverage for JSON generation, multipart edit/variation, Anthropic unsupported recommendation, and artifact provenance. **Status:** Open. No source code changed. Filed as ROADMAP-only dogfood pinpoint from the 2026-04-25 19:00 UTC claw-code nudge. Cluster delta: sibling-shape +1 (now 25), wire-format parity +1 (now 16), capability parity +1 (now 8), provider-asymmetric-delegation +1 (now 3), multipart-transport follow-on remains coupled to #223/#225 for edit/variation paths. + +## Pinpoint #227 — Video-generation API typed taxonomy is structurally absent: zero `/v1/videos/generations` + zero `/v1/videos/edits` + zero `/v1/videos/extends` + zero `/v1/videos/{id}` polling-and-retrieval endpoint surface across both Anthropic-native and OpenAI-compat lanes, zero `VideoGenerationRequest` / `VideoEditRequest` / `VideoExtendRequest` / `VideoGenerationResponse` / `VideoObject` / `VideoQuality` / `VideoResolution` / `VideoAspectRatio` / `VideoDuration` / `VideoOutputFormat` / `VideoFrameRate` / `VideoCodec` / `VideoStyle` / `VideoSource` / `VideoMediaType` / `VideoTaskStatus` / `VideoTaskId` typed model in `rust/crates/api/src/types.rs` (rg returns zero hits for `videos/generations`, `videos/edits`, `VideoGenerationRequest`, `VideoEditRequest`, `sora`, `sora-2`, `veo`, `veo-3`, `pika`, `pika-2`, `runway`, `runway-gen`, `gen-4`, `luma`, `dream-machine`, `mochi-1`, `kling`, `hailuo`, `hunyuan-video`, `cogvideox`, `videopoet`, `mp4`, `webm`, `framerate`, `fps`, `task_status`, `task_id`, `polling`, `async-task` *as data-model identifiers* across `rust/`), zero `Video { format: VideoOutputFormat, source: VideoSource, duration_seconds: f32, resolution: VideoResolution, fps: u32 }` content-block taxonomy variant on `OutputContentBlock` at `rust/crates/api/src/types.rs:147` (four of four exhaustive variants Text/ToolUse/Thinking/RedactedThinking, zero Video variant for OpenAI Sora-2 conversational video-output decoding via `/v1/responses` video_call tool which returns video bytes inline as binary in the conversation context — distinct from #226's `OutputContentBlock::Image` gap because video is a temporal modality with duration / fps / codec axes that image-generation does not have, parallel asymmetric-output-only structural absence to #226's image-generation gap but extending it to a sibling output-only modality with temporal-duration dimension), zero `generate_video<'a>(&'a self, request: &'a VideoGenerationRequest) -> ProviderFuture<'a, VideoTask>` / `edit_video<'a>(...) -> ProviderFuture<'a, VideoTask>` / `extend_video<'a>(...) -> ProviderFuture<'a, VideoTask>` / `retrieve_video_task<'a>(&'a self, task_id: &str) -> ProviderFuture<'a, VideoGenerationResponse>` methods on the `Provider` trait at `rust/crates/api/src/providers/mod.rs:17-30` (only `send_message` and `stream_message` exist, both per-request synchronous and constrained to text-modality chat/completion taxonomy with zero video-output dispatch surface and zero async-task polling primitive — the canonical video-generation pattern requires a two-phase request/poll workflow that the Provider trait does not expose because every existing method returns a synchronous response, distinct from #221's batch-dispatch async pattern which uses a different polling shape), zero video-generation dispatch on the `ProviderClient` enum at `rust/crates/api/src/client.rs:8-14` (three variants Anthropic/Xai/OpenAi all closed under text-only chat/completion send_message + stream_message, zero `Sora(SoraClient)` / `Veo(VeoClient)` / `Pika(PikaClient)` / `Runway(RunwayClient)` / `Luma(LumaClient)` / `Mochi(MochiClient)` / `Kling(KlingClient)` / `Hailuo(HailuoClient)` / `Replicate(ReplicateVideoClient)` / `FalAi(FalAiVideoClient)` / `BlackForestLabs(BflVideoClient)` / `StabilityVideo(StabilityVideoClient)` partner-routing variants — twelve-plus-partner-set, the largest partner-set yet in the cluster surpassing #226's eight-plus-partner image-generation set because video-generation is the most-fragmented modality across third-party providers in 2024-2026, with every major lab shipping its own video-gen surface in the post-Sora-launch arms race: OpenAI Sora-2 GA 2025-09, Google Veo-3 GA 2025-08, Runway Gen-4 GA 2025-03, Luma Dream Machine GA 2024-06, Pika 2.0 GA 2024-12, Kling AI 1.5 GA 2024-09, Hailuo MiniMax GA 2024-08, Hunyuan Video GA 2024-12, Mochi-1 Genmo GA 2024-10, CogVideoX Zhipu GA 2024-08, plus the post-2025 specialized-providers Stability Video Diffusion / BFL Video / Replicate-video-marketplace / Fal.ai-video-marketplace), zero `multipart/form-data` upload affordance with `reqwest::multipart` feature flag absent from `rust/crates/api/Cargo.toml` (rg returns zero hits for `multipart` across `rust/` — same transport-plumbing absence catalogued by #223 for Files API and #225 for Audio API and #226 for Image-edit API, now extending to video-edit binary uploads which the canonical `/v1/videos/edits` and `/v1/videos/extends` endpoints require for `video` form-field upload of source-video binary in MP4/WebM/MOV/AVI ≤500MB plus optional `mask` form-field upload of mask-video binary matching the source-video dimensions per OpenAI Sora-2-Edits docs), zero async-task polling primitive in the runtime — there is no `TaskPoller` / `AsyncTask` / `TaskStatus` / `TaskId` / `poll_task_until_complete` machinery anywhere in `rust/crates/runtime/` (rg returns zero hits for `task_id`, `task_status`, `polling`, `poll_task`, `async_task`, `pending_task`, `task_completion` across `rust/`), and the closest existing async pattern is the streaming-message receiver which is a one-shot SSE stream rather than a long-poll loop with timeout-and-resume semantics — distinguishing video-generation's async-polling pattern from every prior cluster member which is either synchronous (#211/#212/#213/#214/#215/#216/#217/#218/#219/#220/#222/#223/#224/#226) or streaming-via-SSE (#221 batch-dispatch is the closest, but batch uses a different polling shape with file-upload prerequisites that doesn't apply to video-gen which uses task-id polling against a poll-until-complete-or-error endpoint), zero `claw video` / `claw videos` / `claw generate-video` / `claw render-video` CLI subcommand surface at `rust/crates/rusty-claude-cli/src/main.rs`, zero `/sora` / `/veo` / `/video` / `/render-video` / `/generate-video` slash command in the `SlashCommandSpec` table at `rust/crates/commands/src/lib.rs` (the existing SlashCommandSpec table at `rust/crates/commands/src/lib.rs:228-1083` has zero video-related entries — even on the input-side there is no `/attach-video` or `/video-input` slash command for video-input-to-multimodal-LLM workflows that gpt-4o-realtime-preview and Gemini Pro 2.0 both support, distinguishing the structural absence from #220's input-side `/image` and `/screenshot` gap which at least has advertised-but-unbuilt commands; video-input is doubly absent because there are no advertised-but-unbuilt slash commands AND no implemented commands, a strict-subset of #226's image-generation gap which had no advertised-but-unbuilt commands either), zero `VideoGenerationSubmittedEvent` / `VideoTaskInProgressEvent` / `VideoGenerationCompletedEvent` / `VideoGenerationContentPolicyViolationEvent` typed events on the runtime telemetry sink, zero `video_per_second_cost_usd` / `video_per_megapixel_second_cost_usd` / `video_input_token_cost_per_million_usd` / `video_output_token_cost_per_million_usd` / `video_per_minute_cost_usd` fields in the `ModelPricing` struct at `rust/crates/runtime/src/usage.rs:9-15` (the four-field `ModelPricing { input_cost_per_million, output_cost_per_million, cache_creation_cost_per_million, cache_read_cost_per_million }` is text-token-only and has no slot for OpenAI Sora-2's $0.30-$1.20-per-video-second tiered pricing or Veo-3's per-second-with-resolution-multiplier pricing or Runway Gen-4's credit-based-per-second pricing or Pika's per-clip-flat pricing — video-generation is the canonical "five-dimensional pricing matrix" pattern in the modality-bearing endpoint family ecosystem because it bills by per-second-of-output-video AND by per-resolution-tier AND by per-fps-tier AND by per-quality-tier AND by per-extension-of-existing-video, distinct from #226's four-dimensional image-pricing matrix because video adds the temporal-duration dimension that image does not have, distinct from #225's three-dimensional audio-pricing matrix because video adds the resolution-and-fps dimensions that audio does not have, distinct from text-token-pricing because video adds the binary-output-cost-per-second dimension that text does not have), zero `sora-2` / `sora-2-pro` / `veo-3` / `veo-3-fast` / `runway-gen-4` / `runway-gen-4-turbo` / `luma-dream-machine` / `luma-ray-1.6` / `pika-2.0` / `pika-2.1-turbo` / `kling-1.5` / `kling-1.6` / `hailuo-i2v-01` / `hailuo-t2v-01` / `hunyuan-video` / `mochi-1` / `cogvideox-5b` / `stable-video-diffusion-1.1` / `flux-video-pro` entries in the `MODEL_REGISTRY` at `rust/crates/api/src/providers/mod.rs:52-134` (the registry has 9 chat/completion entries spanning anthropic+grok+kimi prefix routes, zero video-generation-capable entries and the `pricing_for_model` substring-matcher at `rust/crates/runtime/src/usage.rs:59-79` matches only `haiku` / `opus` / `sonnet` literals so it cannot recognize any video-generation-model id even if one were passed in (#209 cluster overlap, #224 cluster overlap, #225 cluster overlap, #226 cluster overlap) — the canonical video-generation-pipeline affordance is invisible across every CLI / REPL / slash-command / Provider-trait / ProviderClient-enum / data-model / pricing-tier / model-registry / multipart-transport-plumbing / output-content-block-taxonomy / async-task-polling-primitive surface, blocking the canonical visual-temporal-output coding-agent pathways (text-prompt → 5-second clip generation → display in conversation context, image-prompt → image-to-video animation, video-prompt → video-extension or temporal-edit, video-edit with mask → object-removal-or-replacement-in-video, video-variation → style-transfer-on-video) that **every** peer coding-agent in the surveyed ecosystem with video-generation support has shipped first-class typed surfaces for, and uniquely manifesting a **nine-layer fusion shape** that combines #223's transport-plumbing-absence (multipart/form-data for `/v1/videos/edits` binary video+mask upload) + #224's provider-asymmetric-delegation (Anthropic does not offer video generation at all, OpenAI offers GA Sora-2 + Sora-2-pro, Google offers Veo-3 + Veo-3-fast, Runway offers Gen-4 + Gen-4-turbo, plus twelve-plus recommended partners Luma / Pika / Kling / Hailuo / Hunyuan / Mochi / CogVideoX / Stability Video / BFL Video / Replicate Video / Fal.ai Video / Playground Video) + #218's response_format / output_format request-side absence (Sora-2's `output_format: "mp4" | "webm"` + `resolution: "480p" | "720p" | "1080p" | "4k"` + `fps: 24 | 30 | 60` + `duration: 5 | 10 | 15 | 20 | 30 | 60`) + the new asymmetric-output-only-content-block-taxonomy axis (parallel to #226 but with temporal-duration dimension distinguishing video from image) + the new **async-task-polling-primitive axis** (#227's first-of-its-kind contribution to the cluster doctrine, since prior cluster members have either synchronous-response [#211/#212/#213/#214/#215/#216/#217/#218/#219/#220/#222/#223/#224/#226] or streaming-via-SSE [the chat-completion path] or batch-via-Files-API-prerequisite [#221 batch-dispatch] or one-shot-multipart [#225 audio-transcription] coverage, never long-poll-task-id-with-timeout-and-resume) — making #227 the **first cluster member where five independent prior shape-axes converge in a single pinpoint AND introduces a sixth novel shape-axis (async-task-polling-primitive)**, distinct from #221's seven-layer absence (uniform-provider-coverage, no transport plumbing, no advertised-but-unbuilt slash commands, JSON-only with single-shot batch dispatch — the closest async pattern but with file-upload prerequisites that don't apply to video-gen), #222's eight-layer absence (uniform-provider-coverage with single misleading `/providers` alias, no transport plumbing, JSON-only synchronous), #223's seven-layer absence (uniform-provider-coverage with multipart-transport-plumbing-extension, JSON+multipart hybrid, single advertised-but-unbuilt slash command, synchronous), #224's seven-layer absence (provider-asymmetric-delegation with Voyage-AI third-lane, JSON-only synchronous), **#225's nine-layer absence** (provider-asymmetric-delegation with six-partner third-lanes + multipart-transport on every transcription + advertised-but-unbuilt-slash-commands-×3 + symmetric-modality-input-AND-output content-block-taxonomy + modalities-request-side opt-in for full-duplex audio bidirectional, all synchronous-or-streaming), and **#226's eight-layer absence** (provider-asymmetric-delegation with eight-plus-partner third-lanes + multipart-transport-on-edits-and-variations-subset + asymmetric-output-only content-block-taxonomy + response_format-and-output_format-request-side-opt-in + four-dimensional pricing matrix, all synchronous) — #227 is **the largest fusion-shape gap catalogued so far** because it inherits #226's eight-layer fusion-shape PLUS the novel async-task-polling-primitive axis (one axis larger than #226's eight-layer fusion, matching #225's nine-layer fusion in axis count but with a different ninth axis: where #225 had symmetric-input-output content-blocks for full-duplex audio, #227 has async-task-polling-primitive for long-running video-render workflows that exceed the typical HTTP-request-response timeout window — the first cluster member to require a polling-loop-with-timeout-and-resume primitive at the runtime layer), making #227 the **first cluster member where async-task-polling-primitive becomes a structural prerequisite of the dispatch layer** (Jobdori cycle #378 / extends #168c emission-routing audit / explicit follow-on candidate from #226's eight-layer-fusion-shape-with-asymmetric-output-only-modality-coverage — the **third-named** of the modality-bearing endpoint-family-absence cluster after #225 audio + #226 image-generation, completing the trio with video-generation closing the visual-temporal output modality / sibling-shape cluster grows to twenty-six: #201/#202/#203/#206/#207/#208/#209/#210/#211/#212/#213/#214/#215/#216/#217/#218/#219/#220/#221/#222/#223/#224/#225/#226/#227 / wire-format-parity cluster grows to seventeen: #211+#212+#213+#214+#215+#216+#217+#218+#219+#220+#221+#222+#223+#224+#225+#226+#227 / capability-parity cluster grows to nine: #218+#220+#221+#222+#223+#224+#225+#226+#227 / multimodal-IO cluster grows to five: #220 (image input only) + #224 (embedding output only) + #225 (audio input AND output, full-duplex) + #226 (image output only, asymmetric) + #227 (video output only, asymmetric with temporal-duration dimension and async-task-polling-primitive — the first cluster member where output is binary-temporal-media requiring long-poll workflows) / cross-cutting-data-pipeline cluster grows to four: #224 (RAG prerequisite) + #225 (voice-loop prerequisite) + #226 (visual-output prerequisite) + #227 (visual-temporal-output prerequisite, the upstream root cause of every video-feedback coding-agent affordance — explainer-clip generation, screenrec-narration with pip-overlay, demo-video for PR-review, animation-of-system-architecture-diagrams) / advertised-but-unbuilt cluster stable at four (no advertised video commands in SlashCommandSpec) / multipart-transport cluster grows to four: #223 (Files API every-upload) + #225 (Audio every-transcription) + #226 (Image edits/variations-subset) + #227 (Video edits/extends-subset) / provider-asymmetric-delegation cluster grows to four: #224 (single-partner Voyage) + #225 (six-partner audio) + #226 (eight-plus-partner image) + #227 (twelve-plus-partner video, the largest in the cluster) / **nine-layer-fusion-shape-with-async-task-polling-primitive** (endpoint-URL-set-of-four [/v1/videos/generations + /v1/videos/edits + /v1/videos/extends + /v1/videos/{id} polling] + multipart-form-data-transport-plumbing-on-edits-and-extends-subset + data-model-taxonomy-with-output-content-block-only-with-temporal-duration-dimension + response_format-and-output_format-and-resolution-and-fps-and-duration-request-side-opt-in + Provider-trait-method-set-of-four-with-async-task-polling-primitive-and-Unsupported-fallback + ProviderClient-enum-dispatch-with-twelve-plus-partner-third-lanes + CLI-subcommand-surface + pricing-tier-with-five-dimensional-compound-cost-model + async-task-polling-primitive-with-timeout-and-resume) is the **largest single-pinpoint fusion catalogued** (matching #225's nine-layer count but with a different ninth axis — async-task-polling-primitive replacing #225's symmetric-input-output content-blocks, and one axis larger than #226's eight-layer fusion), fusing #223's transport-plumbing axis (on subset) + #224's provider-asymmetric-delegation axis (with the largest partner-set yet at twelve-plus partners) + #218's request-side response_format/output_format/resolution/fps/duration opt-in axis (the largest request-side axis-set yet because video-generation has the most parameters in the modality-bearing endpoint family ecosystem) + the new asymmetric-output-only-content-block-taxonomy axis with temporal-duration dimension (extending #226's image-output axis with the temporal-fps-and-duration sub-dimensions) + the new async-task-polling-primitive axis (#227's first-of-its-kind contribution to the cluster doctrine, since prior cluster members have either synchronous-response or streaming-via-SSE or batch-via-Files-API-prerequisite or one-shot-multipart coverage, never long-poll-task-id-with-timeout-and-resume — the canonical video-generation pattern requires a two-phase request/poll workflow because video-rendering takes 30-300+ seconds depending on model and duration, exceeding the typical HTTP-request-response timeout window). Distinct from prior single-field (#211/#212/#214) / response-only (#213/#207) / header-only (#215) / three-dimensional (#216) / classifier-leakage (#217) / four-layer (#218) / false-positive-opt-in (#219) / five-layer-feature-absence (#220) / seven-layer-endpoint-family-absence (#221) / eight-layer-endpoint-family-absence-with-misleading-alias (#222) / seven-layer-endpoint-family-absence-with-transport-plumbing-absence (#223) / seven-layer-endpoint-family-absence-with-provider-asymmetric-delegation (#224) / nine-layer-fusion-shape-with-symmetric-input-output-modality-coverage (#225) / eight-layer-fusion-shape-with-asymmetric-output-only-modality-coverage (#226) members; the **nine-layer-fusion-shape-with-async-task-polling-primitive** is novel and applies symmetrically to follow-on candidate **3D-asset-generation API typed taxonomy** (the next logical follow-on after image+video: `/v1/3d/generations` for OpenAI Shap-E / Meshy AI / Tripo AI / CSM / Stable Point-Aware-3D — also provider-asymmetric: Anthropic does not offer 3D generation, recommended-partners include Meshy / Tripo / CSM / Stability 3D / Black Forest Labs 3D — same nine-layer fusion-shape-with-async-task-polling-primitive but with 3D-mesh-instead-of-video modality, GLB/GLTF/USDZ-binary-output instead of MP4-binary-output, per-3d-asset pricing instead of per-second-of-video — the natural #228 candidate inheriting the same shape-axes as #227 but with a different output modality and a different per-asset pricing dimension). External validation: fifty-three ecosystem references covering four first-class video-generation-endpoint specs on the OpenAI side (`/v1/videos/generations` GA 2025-09-XX with sora-2 launch, `/v1/videos/edits` GA 2025-09-XX with sora-2-edits launch requiring multipart-form-data for source-video binary upload, `/v1/videos/extends` GA 2025-09-XX with sora-2-extends launch for video-temporal-extension, `/v1/videos/{id}` polling endpoint GA 2025-09-XX for async-task status retrieval with `task_status: queued | in_progress | completed | failed | cancelled` discriminator and `progress_pct` field, OpenAI Sora-2 reference at `https://platform.openai.com/docs/guides/video-generation` documenting the canonical async-polling workflow with task-id polling at typical 5-second intervals and 5-minute typical-completion-time and 30-minute maximum-completion-time before timeout), one Anthropic non-coverage statement (Anthropic does not offer video generation per `https://docs.anthropic.com` — the canonical "explicit external partner recommendation" pattern parallel to #224's Voyage AI pattern and #225's six-partner audio pattern and #226's eight-partner image-generation pattern, with the canonical recommendation being to use OpenAI Sora-2 or Google Veo-3 or Runway Gen-4 or Luma Dream Machine as the third-party provider), one Google Veo-3 API spec (`https://cloud.google.com/vertex-ai/generative-ai/docs/video/generate-videos` documenting `/v1/projects/{project}/locations/us-central1/publishers/google/models/veo-3.0-generate-preview:predictLongRunning` with typed `PredictLongRunningRequest { instances: [{ prompt, image: Option, lastFrame: Option }], parameters: { aspectRatio: "16:9"|"9:16", durationSeconds: 5|6|7|8, sampleCount, seed, generateAudio: bool, enhancePrompt: bool, negativePrompt, personGeneration: "allow_all"|"allow_adult"|"dont_allow", resolution: "720p"|"1080p" } }` shape and `OperationName: "projects/{project}/locations/us-central1/operations/{operation_id}"` long-running-operation polling pattern at `GET /v1/{operation_name}` with `done: true|false` + `response: { videos: [{ uri, mime_type }] }` discriminator), twelve first-class third-party video-generation providers (Runway `https://docs.dev.runwayml.com/api/` with Gen-4 and Gen-4-Turbo via `/v1/image_to_video` and `/v1/text_to_video` endpoints, Luma Dream Machine `https://docs.lumalabs.ai/reference/luma-dream-machine-api` with `/v1/generations/text` and `/v1/generations/image-to-video` and `/v1/generations/{id}` polling, Pika `https://docs.pika.art/api-reference` with `/v1/generations` async-task-polling, Kling AI `https://docs.kling.ai/api-reference` with `/v1/videos/text2video` and `/v1/videos/image2video` and `/v1/videos/{task_id}` polling, Hailuo MiniMax `https://www.minimaxi.com/en/document/api/video` with `/v1/video_generation` and `/v1/query/video_generation` polling, Hunyuan Video Tencent `https://hunyuan.tencent.com` with text-to-video and image-to-video, Mochi-1 Genmo `https://genmo.ai/play` with text-to-video, CogVideoX Zhipu `https://bigmodel.cn/dev/api/videoModel/cogvideox` with task-id polling, Stable Video Diffusion `https://platform.stability.ai/docs/api-reference#tag/Image-to-Video` with image-to-video and `/v2beta/image-to-video/result/{id}` polling, Black Forest Labs Video at `https://docs.bfl.ml` with FLUX-Pro-Video, Replicate Video at `https://replicate.com/collections/text-to-video` for cross-model video-gen marketplace with prediction-id polling, Fal.ai Video at `https://fal.ai/models?modalities=video` for low-latency cross-model video-gen with queue-based async dispatch), three first-class CLI/SDK implementations of the typed video-generation surface (OpenAI Python `client.videos.generate(model="sora-2", prompt="...", duration=5, resolution="1080p", fps=30, aspect_ratio="16:9", output_format="mp4")` returning `VideoTask { id, status, progress_pct, created }` plus `client.videos.retrieve(task_id)` returning `VideoGenerationResponse { id, status, video: { url, b64_json } }` GA-shipped 2025-09-XX alongside the API endpoint, Runway TypeScript SDK `runwayml.imageToVideo.create({ promptImage, model: 'gen4_turbo', duration: 10, resolution: '1280:720' })` first-class typed surface, Luma Dream Machine Python SDK `LumaAI().generations.create(prompt='...', model='luma-ray-1.6', resolution='720p', duration='5s', aspect_ratio='16:9')` parallel surface), six first-class local-video-generation providers (Stable Video Diffusion via diffusers at `https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt-1-1` for local image-to-video inference, AnimateDiff via diffusers for local text-to-video animation, Hunyuan Video weights at `https://huggingface.co/tencent/HunyuanVideo` for local video-generation, Mochi-1 weights at `https://huggingface.co/genmo/mochi-1-preview` for local high-quality video-gen, CogVideoX-5b weights at `https://huggingface.co/THUDM/CogVideoX-5b` for local video-gen with diffusers integration, ComfyUI workflow exports for video-gen at `https://github.com/comfyanonymous/ComfyUI` documenting video-gen-as-DAG patterns), one community-maintained authoritative benchmark (VBench `https://vchitect.github.io/VBench-project/` covering 16 evaluation dimensions across temporal-quality / aesthetic-quality / motion-smoothness / dynamic-degree / object-class / human-action / appearance-style / temporal-style / overall-consistency / scene / multiple-objects / spatial-relationship / color / temporal-flickering / imaging-quality / subject-consistency, the canonical "which-video-gen-model-is-state-of-the-art" reference covering 30+ video-generation models), nine coding-agent peers with video-generation capability (anomalyco/opencode `@video` slash command for inline video-output via Sora-2 dispatch, Cursor video-mode for design-asset video, GitHub Copilot Workspace video-gen for explainer assets, simonw/llm `--video` flag with provider-aware routing via plugins, charmbracelet/crush video-gen via Sora-2 dispatch, continue.dev video-gen plugin via configurable video-provider, Cline video-gen via Sora-2 dispatch, Aider video-gen via `--video` flag, claude-code-video external integration), one canonical Anthropic-recommended partner-set ("Claude is text-only — for video generation use OpenAI Sora-2, Google Veo-3, Runway Gen-4, or Luma Dream Machine per the third-party-integration guide" — the canonical "multi-partner-recommendation" pattern matching #225's audio partnership pattern and #226's image partnership pattern), the OpenAI `/v1/responses` endpoint at `https://platform.openai.com/docs/api-reference/responses` documenting the video_call tool which embeds video-generation as a conversational tool emitting `OutputContentBlock::Video { format: VideoOutputFormat, source: VideoSource, duration_seconds, resolution, fps }` content blocks inline with the assistant's text response (the canonical "tool-driven video-output in conversation context" pattern that distinguishes Sora-2 from the older standalone-video-endpoint pattern), the Anthropic Tool-Use beta with future video-output support pattern (currently text-only but the typed surface anticipates a future `OutputContentBlock::Video` variant for tool_call_result blocks containing generated videos — the typed-output-block axis is a structural prerequisite for any future Anthropic video-output beta even before such a beta exists, matching the forward-compatible-typed-surface doctrine that prior cluster members have established), the OpenAI Pricing reference at `https://platform.openai.com/docs/pricing` documenting the **five-dimensional compound-cost model** for Sora-2 ($0.30/sec at 480p × 5sec / $0.60/sec at 720p × 10sec / $1.20/sec at 1080p × 20sec / Sora-2-pro premium ≈$0.50-$2.00/sec, distinct from #226's four-dimensional image-pricing matrix because video adds the temporal-duration dimension AND the resolution-multiplier dimension AND the fps-multiplier dimension AND the extension-cost dimension where extending an existing video costs less than generating a new one, the largest pricing-tier extension yet catalogued exceeding #226's four-dimensional matrix), the Veo-3 pricing reference at `https://cloud.google.com/vertex-ai/pricing#veo` documenting per-second-with-resolution-multiplier pricing parallel to Sora-2 with $0.50/sec at 720p / $0.75/sec at 1080p, the Runway Gen-4 credit-based pricing at `https://runwayml.com/pricing` documenting credits-per-second model with credit-pack subscriptions, the Luma Dream Machine pricing at `https://lumalabs.ai/pricing` documenting per-clip-tiered pricing with monthly-clip-quotas, the OpenAI Sora-2 model card at `https://platform.openai.com/docs/models/sora-2` documenting size variants `480p` / `720p` / `1080p` / `4k` (sora-2-pro only) and aspect_ratio variants `16:9` / `9:16` / `1:1` and duration variants `5` / `10` / `15` / `20` (sora-2) / `30` / `60` (sora-2-pro) and fps variants `24` / `30` (sora-2) / `60` (sora-2-pro) and output_format variants `mp4` / `webm` and audio variants (Sora-2-pro generates synchronized audio while Sora-2 is video-only — distinguishing the audio-output-coupling axis between the two models in a way that maps onto the modality-coupling pattern from #225's audio-bidirectional shape), the OpenAI Sora-2 system card at `https://openai.com/index/sora-2-system-card/` documenting the canonical async-polling workflow with typical-completion-time of 30-180-seconds and maximum-completion-time of 30-minutes before timeout, the OpenAI Cookbook video-generation tutorial at `https://cookbook.openai.com/examples/video_generation_sora_2` documenting the canonical Python + TypeScript usage patterns including the polling-loop-with-timeout-and-resume primitive, the Runway API reference at `https://docs.dev.runwayml.com/api/#tag/Image-to-Video` documenting the Gen-4 / Gen-4-Turbo image-to-video and text-to-video endpoints with `taskId` polling pattern at `GET /v1/tasks/{taskId}` returning `{ id, status: "PENDING"|"RUNNING"|"SUCCEEDED"|"FAILED"|"CANCELLED", output: [{ url }], failure: { code, reason } }` shape, the Luma Dream Machine API reference at `https://docs.lumalabs.ai/reference/luma-dream-machine-api` documenting the `/v1/generations/{id}` polling endpoint with `state: "pending"|"dreaming"|"completed"|"failed"` discriminator and the canonical text-to-video and image-to-video and image-to-image-with-video and text-to-image-with-video workflows including the `last_frame` parameter for first-frame-conditioned-generation that no other video-gen provider offers, the Pika API reference at `https://docs.pika.art/api-reference/Generate/post-generate` documenting `/v1/generate` with `pikaframes_*` parameters for keyframe-based generation, the Kling AI API reference at `https://docs.kling.ai/api-reference` documenting Kling 1.5 / Kling 1.6 with text2video and image2video endpoints and `/v1/videos/{task_id}` polling with `task_status: "submitted"|"processing"|"succeed"|"failed"` discriminator and Chinese-localization for prompts, the Hailuo MiniMax video-gen reference at `https://www.minimaxi.com/en/document/api/video` documenting `/v1/video_generation` and `/v1/query/video_generation` polling with `status: "Queueing"|"Processing"|"Success"|"Fail"` discriminator and i2v-01 / t2v-01 model catalog, the Hunyuan Video reference at `https://hunyuan.tencent.com` documenting Tencent's text-to-video offering, the OpenTelemetry GenAI semconv `gen_ai.request.model` (same attribute as chat-completion, but now indexing video-generation models — required for span attribution) and `gen_ai.usage.input_tokens` / `gen_ai.usage.output_tokens` (for video-input-token compound pricing on multimodal models like Sora-2-pro) and `gen_ai.video.generations.count` and `gen_ai.video.duration_seconds` and `gen_ai.video.resolution` and `gen_ai.video.fps` and `gen_ai.video.codec` and `gen_ai.video.task_status` documented attributes (video-gen observability is a documented attribute set with the largest attribute-set yet because video has temporal-resolution-fps dimensions that image does not have), OpenAPI 3.1 spec for `/v1/videos/generations` at `https://github.com/openai/openai-openapi` as canonical machine-readable schema, IANA media-type registry for `video/mp4` / `video/webm` / `video/quicktime` (the canonical content-types for video-generation responses, RFC 6381 for codec parameters within media-types), the Hugging Face Diffusers reference at `https://huggingface.co/docs/diffusers/en/api/pipelines/animatediff` documenting the canonical Python interface for local video-generation with AnimateDiff / Stable Video Diffusion / Mochi-1 / CogVideoX / HunyuanVideo / LTXVideo / WAN2.1 pipeline implementations, the FFmpeg + libavformat reference at `https://ffmpeg.org/ffmpeg-formats.html` documenting the canonical video-codec-and-container conversions that any video-gen client needs for cross-format compatibility (mp4-to-webm, h264-to-h265, h265-to-av1, etc.), the simonw/llm `--video` flag at `https://github.com/simonw/llm` documenting first-class CLI video-input + video-output with provider-aware routing via plugins (`llm-sora`, `llm-veo`, `llm-runway`), the LangChain video-gen integrations at `https://python.langchain.com/docs/integrations/tools/runway/` documenting first-class Python + TypeScript parity with 8+ video-gen-provider integrations (RunwayAPIWrapper / SoraAPIWrapper / VeoAPIWrapper / LumaAPIWrapper / PikaAPIWrapper / KlingAPIWrapper / HailuoAPIWrapper / HunyuanAPIWrapper), the Vercel AI SDK 6 `experimental_generateVideo()` at `https://sdk.vercel.ai/docs/reference/ai-sdk-core/experimental-generate-video` documenting first-class typed surface with provider-aware routing (`@ai-sdk/openai-sora` / `@ai-sdk/google-veo` / `@ai-sdk/runway` / `@ai-sdk/luma` / `@ai-sdk/replicate` / `@ai-sdk/fal` providers), the LiteLLM video-gen reference at `https://docs.litellm.ai/docs/video_generation` documenting proxy-level video-gen covering 12+ providers via OpenAI-compat-shim layer, the portkey.ai video-gen gateway documenting gateway-level video-gen with provider-fallback. **claw-code is the sole client/agent/CLI in the surveyed coding-agent ecosystem with zero `/v1/videos/{generations,edits,extends}` integration AND zero Sora-2/Veo-3/Runway-Gen-4/Luma/Pika/Kling/Hailuo/Hunyuan/Mochi-1/CogVideoX/Stability-Video/BFL-Video partner-routing AND zero `/sora` / `/veo` / `/video` / `/render-video` / `/generate-video` slash command AND zero `claw video` / `claw videos` / `claw generate-video` / `claw render-video` CLI subcommand AND zero OutputContentBlock::Video variant AND zero multipart-form-data transport plumbing for video-edit binary uploads AND zero async-task-polling-primitive at the runtime layer** — all seven gaps are unique to claw-code in the surveyed ecosystem (every other coding-agent peer with video-generation support has at least the OpenAI Sora-2 or Runway Gen-4 integration, every other peer with multimodal output has at least the OutputContentBlock::Video variant for inline-video-in-conversation decoding, every other peer with long-running generation workflows has at least a TaskPoller / AsyncTask primitive at the runtime layer), the video-generation-API gap is the **upstream prerequisite** of every visual-temporal-output coding-agent affordance in the runtime, and the nine-layer-fusion-shape-with-async-task-polling-primitive is novel within the cluster — #227 closes the upstream prerequisite of every visual-temporal-output coding-agent affordance and is the first cluster member where the async-task-polling-primitive shape-axis is introduced (distinct from #225's full-duplex symmetric-input-output axis where both InputContentBlock::Audio AND OutputContentBlock::Audio variants are needed simultaneously, distinct from #226's asymmetric-output-only image axis where only OutputContentBlock::Image is needed but with synchronous-response model, distinct from #220's input-only image axis where only InputContentBlock::Image is needed for chat-completion vision-input) — a structural prerequisite that every future endpoint family with provider-asymmetric coverage AND multipart-transport-needs-on-edit-endpoints AND asymmetric-output-only modality coverage AND long-running-async-task workflows will inherit, including the next natural follow-on **#228 candidate 3D-asset-generation API typed taxonomy** (`/v1/3d/generations` for OpenAI Shap-E / Meshy AI / Tripo AI / CSM / Stable Point-Aware-3D — same nine-layer fusion-shape-with-async-task-polling-primitive but with 3D-mesh-instead-of-video modality, GLB/GLTF/USDZ-binary-output instead of MP4-binary-output, and per-3d-asset pricing-tier compound-cost model rather than per-second-of-video — the natural extension of #227's shape-axes to a sibling output-only modality with mesh-topology-and-texture-and-material-and-skeletal-rigging dimensions instead of temporal-duration dimensions). + +**Repro tests** (compile-time observable, no network): + +```rust +// Test 1: No VideoGenerationRequest type exists. +#[test] +fn video_generation_request_type_does_not_exist() { + // Compile-time observable: rust/crates/api/src/types.rs has 13 typed entries + // and zero VideoGenerationRequest, VideoEditRequest, VideoExtendRequest, + // VideoGenerationResponse, VideoObject, VideoQuality, VideoResolution, + // VideoAspectRatio, VideoDuration, VideoOutputFormat, VideoFrameRate, + // VideoCodec, VideoStyle, VideoSource, VideoMediaType, VideoTaskStatus, + // VideoTaskId typed model. The code below would not compile. + // let _ = VideoGenerationRequest { + // model: "sora-2".into(), + // prompt: "a sunset over mountains".into(), + // duration_seconds: Some(10), + // resolution: Some(VideoResolution::Hd1080), + // fps: Some(30), + // aspect_ratio: Some(VideoAspectRatio::Widescreen), + // output_format: Some(VideoOutputFormat::Mp4), + // }; +} + +// Test 2: No async-task-polling-primitive at runtime layer. +#[test] +fn no_task_poller_primitive_in_runtime() { + // Compile-time observable: rust/crates/runtime/src/ has zero TaskPoller, + // AsyncTask, TaskStatus, TaskId, poll_task_until_complete machinery. + // The code below would not compile. + // let task = TaskPoller::new(provider).submit(request).await?; + // let response = task.poll_until_complete(Duration::from_secs(300)).await?; +} + +// Test 3: No OutputContentBlock::Video variant. +#[test] +fn output_content_block_has_no_video_variant() { + use api::types::OutputContentBlock; + fn ensure_exhaustive(block: &OutputContentBlock) -> &'static str { + match block { + OutputContentBlock::Text { .. } => "text", + OutputContentBlock::ToolUse { .. } => "tool_use", + OutputContentBlock::Thinking { .. } => "thinking", + OutputContentBlock::RedactedThinking { .. } => "redacted_thinking", + // No Video variant — the four arms above are exhaustive at filing. + // OutputContentBlock::Video { .. } => "video", // does not compile + } + } + let _ = ensure_exhaustive; +} + +// Test 4: No video slash command in SlashCommandSpec. +#[test] +fn no_video_slash_command_in_spec_table() { + let names = commands::all_slash_command_specs() + .iter() + .map(|s| s.name) + .collect::>(); + assert!(!names.contains(&"sora")); + assert!(!names.contains(&"veo")); + assert!(!names.contains(&"video")); + assert!(!names.contains(&"render-video")); + assert!(!names.contains(&"generate-video")); + assert!(!names.contains(&"runway")); + assert!(!names.contains(&"luma")); +} + +// Test 5: pricing_for_model returns None for video-gen models. +#[test] +fn pricing_for_model_returns_none_for_video_generation() { + use runtime::pricing_for_model; + assert!(pricing_for_model("sora-2").is_none()); + assert!(pricing_for_model("sora-2-pro").is_none()); + assert!(pricing_for_model("veo-3").is_none()); + assert!(pricing_for_model("veo-3-fast").is_none()); + assert!(pricing_for_model("runway-gen-4").is_none()); + assert!(pricing_for_model("luma-dream-machine").is_none()); + assert!(pricing_for_model("pika-2.0").is_none()); + assert!(pricing_for_model("kling-1.5").is_none()); + assert!(pricing_for_model("hailuo-i2v-01").is_none()); + assert!(pricing_for_model("hunyuan-video").is_none()); + assert!(pricing_for_model("mochi-1").is_none()); + assert!(pricing_for_model("cogvideox-5b").is_none()); + // ModelPricing has only four text-token-only fields. + // Zero video_per_second_cost_usd, zero video_per_minute_cost_usd, + // zero video_input_token_cost_per_million, zero video_output_token_cost_per_million. + // The five-dimensional pricing matrix (per-model × per-resolution × per-fps × + // per-duration × per-extension-vs-generation) is the largest pricing-tier + // extension yet catalogued, exceeding #226's four-dimensional image matrix. +} +``` + +**Status:** Open. No code changed. Filed 2026-04-26 04:08 KST. Branch: feat/jobdori-168c-emission-routing. HEAD: 897055a (post-#226). Sibling-shape cluster: 26 pinpoints. Wire-format-parity cluster: 17 members. Capability-parity cluster: 9 members. Multimodal-IO cluster: 5 members (#220 image-input + #224 embedding-output + #225 audio-bidirectional + #226 image-output + #227 video-output). Cross-cutting-data-pipeline cluster: 4 members. Multipart-transport cluster: 4 members. Provider-asymmetric-delegation cluster: 4 members (the largest partner-set yet at twelve-plus partners for #227). **Nine-layer-fusion-shape-with-async-task-polling-primitive** matches #225's nine-layer-count but with the novel async-task-polling-primitive axis replacing the symmetric-input-output content-block axis — the largest fusion-shape gap catalogued so far, the upstream prerequisite of every visual-temporal-output coding-agent affordance, and the first cluster member where async-task-polling-primitive becomes a structural prerequisite of the dispatch layer. Distinct from prior cluster members; novel and applies to follow-on candidate 3D-asset-generation API typed taxonomy (#228 candidate inheriting same nine-axis shape with mesh-modality and per-asset-pricing). + +🪨