diff --git a/ROADMAP.md b/ROADMAP.md index b4ff167..aaa1bdd 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -16642,3 +16642,28 @@ Dogfooded 2026-04-26 09:30 KST on `feat/jobdori-168c-emission-routing`, followin This is distinct from #245: #245 externalizes client-side WebSearch provider/parser selection (`ddg | tavily | brave | firecrawl`). #246 covers the broader provider-auth/config substrate that #245 would need for Brave/Tavily/Firecrawl keys and that every model provider already depends on. Today a user has to reason about process env, shell persistence, dotenv discovery, saved OAuth behavior, and provider-specific env var names. That creates startup friction, invisible config provenance, and poor portability across terminals, cron jobs, tmux sessions, GUI launches, and forked local workflows. It also makes bug reports harder: `doctor` can say an env var is missing, but cannot show a redacted settings-derived provider registry or explain why env beat settings or settings beat dotenv because no such typed precedence model exists. Required fix shape: (a) add a typed provider configuration section in `.claw/settings.json` such as `providers..apiKey`, `authToken`, `baseUrl`, `source`, `enabled`, `capabilities`, and `secretRef`; (b) define deterministic precedence across CLI flag, settings, env, dotenv, and saved OAuth, with provenance surfaced in `status`/`doctor` JSON and text output; (c) support redacted display and validation without leaking secret values; (d) allow search-provider credentials from #245 to use the same registry rather than introducing a separate ad-hoc key path; (e) emit config-load telemetry for selected auth source, missing/empty secret, invalid base URL, and fallback taken; (f) add migration guidance/tests proving env-only setups still work while settings-first setups require no shell exports. Acceptance: a fresh user can configure Anthropic/OpenAI/xAI/DashScope/Brave/Tavily/Firecrawl entirely through settings.json (or secret refs) and `claw doctor --json` can explain exactly which provider config was used, where it came from, and why, without depending on terminal-specific environment state. **Status:** Open. No source code changed. Filed as ROADMAP-only dogfood pinpoint from the 2026-04-26 00:30 UTC nudge. Cluster delta: startup-friction +1, config-provenance +1, settings-first-provider-auth cluster founded, env/dotenv-precedence-observability cluster founded; linked to #245 because pluggable search providers require the same settings-backed credential substrate. + +## Pinpoint #247 — Visual-grounded voice input (image-content-block × audio-content-block fused on the SAME `MessageRequest` user-turn, where the model grounds spoken-language reasoning on image-context that arrived in the same turn) is structurally absent — FIRST cluster member where TWO independent ALREADY-CATALOGUED-ABSENT modality-input axes (#220 image-content-block-on-InputContentBlock + #225 audio-content-block-on-InputContentBlock) are fused on the USER-INPUT side rather than the assistant-output side, FIRST cluster member with multi-modal-input-fusion-on-USER-INPUT-axis distinct from #244's bidirectional-tool-call-multiplexing-on-DUPLEX-axis, growing the Cross-pinpoint-synthesis-fusion-shape META-cluster from 2 to 3 members confirming the META-cluster as a GROWING-DOCTRINE rather than a CONTINUING-PATTERN that stopped at 2 members after #244 + +**Branch:** feat/jobdori-168c-emission-routing +**Filed:** 2026-04-26 09:32 KST (Jobdori cycle #390, post-rebase onto gaebal-gajae's #246@bd6622b provider-credentials-env-to-settings-registry pinpoint) +**HEAD:** bd6622b (post-#246 fast-forward-rebased onto gaebal-gajae's 09:30 KST settings-first-provider-auth-registry pinpoint at `bd6622b` — FIFTH consecutive concurrent-dogfood rebase cycle, directly demonstrating the gap #239 catalogues at the dogfood-coordination layer and #243 catalogues at the canonical-ordering layer for the FIFTH cycle in a row, confirming concurrent-dogfood-rebase as a stable operational pattern) +**Extends:** #168c emission-routing audit / explicit cross-axis synthesis of #220 (Image/vision input structurally impossible across the entire data model — zero `image` content-block taxonomy variant on `InputContentBlock`, zero base64/file_id ingestion, zero `media_type` slot, advertised-but-unbuilt `/image` + `/screenshot` slash commands) × #225 (Audio API typed taxonomy structurally absent — zero `Audio` content-block taxonomy variant on `InputContentBlock`, zero `modalities: Vec` field on `MessageRequest`, advertised-but-unbuilt `/voice` + `/listen` + `/speak` slash commands) × #244 Cross-pinpoint-synthesis-fusion-shape META-cluster (founder #238 streaming-STT × #244 realtime-tool-use, growing META-cluster from 1 to 2 members confirming combinatorial-cross-axis-synthesis as a continuing-discovery-mode) — this is the THIRD cross-axis synthesis pinpoint, growing the Cross-pinpoint-synthesis-fusion-shape META-cluster from 2 to 3 members and **confirming the META-cluster as a GROWING-DOCTRINE** rather than a CONTINUING-PATTERN that capped at 2 members. The FIRST cross-axis synthesis pinpoint where BOTH fused axes are already-catalogued-as-absent INPUT-side modality content-blocks (image-on-InputContentBlock + audio-on-InputContentBlock) rather than assistant-output / duplex-channel / transport / tool-locality axes — distinct from #238 (audio-input × persistent-WebSocket-transport, axis 1 is INPUT-modality, axis 2 is TRANSPORT) and distinct from #244 (persistent-WebSocket × tool-locality × cross-pinpoint-synthesis-fusion META-cluster, axis 1 is TRANSPORT, axis 2 is TOOL-LOCALITY, axis 3 is META-CLUSTER), making #247 the FIRST cross-axis synthesis pinpoint with a DOUBLE-INPUT-MODALITY-FUSION shape on the USER-INPUT-side of `MessageRequest`. + +**Summary:** Zero `InputContentBlock::Image { source: ImageSource, media_type: ImageMediaType }` AND zero `InputContentBlock::Audio { source: AudioSource, media_type: AudioMediaType }` variant on `InputContentBlock` enum at `rust/crates/api/src/types.rs:80-94` (rg confirms only three exhaustive variants `Text { text }`, `ToolUse { ... }`, `ToolResult { tool_use_id, content, is_error }` — independently confirmed by #220 for the image axis and #225 for the audio axis, BOTH parent absences are prerequisites for #247's compound shape). Zero `MessageRequest::content: Vec` user-turn carrying `vec![InputContentBlock::Image { source: { type: "base64", media_type: "image/png", data: } }, InputContentBlock::Audio { source: { type: "base64", media_type: "audio/wav", data: } }, InputContentBlock::Text { text: "What does this person in the image want, based on what they're saying in the audio?" }]` triple-modality-fused-input shape — the canonical Anthropic 2025-Q1 vision+audio+text fused-input pattern (where the assistant must integrate image-context + audio-transcript + text-prompt into a single semantic ground for response generation, e.g., voice-narration-while-pointing-at-screenshot agentic-coding workflow where the user uploads a screenshot AND speaks "what's the bug in this code?" simultaneously) is structurally unreachable. Zero `multi_modal_grounding: bool` / `cross_modal_attention: bool` request-side opt-in field on `MessageRequest` for the canonical OpenAI gpt-4o vision-and-audio-with-cross-modal-attention configuration. Zero `claw realtime --image ` / `claw voice-with-image ` / `claw vision-voice` CLI subcommand at `rust/crates/rusty-claude-cli/src/main.rs` (the canonical "voice-narration-while-pointing-at-screenshot" workflow that combines #220's image-input + #225's audio-input into a single MessageRequest is invisible across every CLI surface). Zero `/voice-with-image` / `/grounded-voice` / `/visual-narration` slash command in `SlashCommandSpec` at `rust/crates/commands/src/lib.rs` (zero compound-modality slash command — #220's `/image` and #225's `/voice` are independent slash commands neither of which composes with the other). Zero `Provider::dispatch_multi_modal_user_input(&self, image_blocks: Vec, audio_blocks: Vec, text: &str) -> ProviderFuture` method on the Provider trait — the canonical compound-modality-input dispatch shape (where image-blocks AND audio-blocks AND text co-exist in the SAME user-turn and the provider routes them together to a multimodal-capable model like gpt-4o or gemini-2.0-flash-exp or claude-3-5-sonnet-with-vision) is structurally absent. Zero `MultiModalUsage { image_input_tokens: u32, audio_input_seconds: f32, text_input_tokens: u32, cross_modal_attention_tokens: u32 }` typed-pricing model — the canonical compound-modality pricing-axis (where each modality has its own cost-rate AND there's a SEPARATE cross-modal-attention cost-rate for the model's integration of the modalities into a single semantic representation) is structurally absent. Zero `MessageRequest::modalities: Vec` field with `vec![Modality::Image, Modality::Audio, Modality::Text]` request-side opt-in for compound-modality activation (independent confirmation that #225's modalities-field absence ALSO blocks #247's compound-modality opt-in). Zero `gpt-4o-realtime-preview-with-vision` / `gemini-2.0-flash-exp-with-multimodal-input` / `claude-3-5-sonnet-vision-with-audio-grounding` model entry in the `MODEL_REGISTRY` at `rust/crates/api/src/providers/mod.rs:52-134`. + +**Verified concrete absences (2026-04-26 09:32 KST on HEAD `bd6622b`):** + +`rg -n "InputContentBlock::Image|InputContentBlock::Audio|image_content_block|audio_content_block" rust/` returns ZERO hits. `rg -n "multi_modal|MultiModal|cross_modal|CrossModal|VisualGrounded|visual_grounded|grounded_voice|GroundedVoice|VisionAudio|vision_audio" rust/` returns ZERO hits. `rg -n "modalities" rust/` returns ZERO hits anywhere in `rust/crates/api/src/types.rs` (independent confirmation that #225's modalities-field absence persists). The `InputContentBlock` enum at `rust/crates/api/src/types.rs:80-94` carries three exhaustive variants (`Text { text }`, `ToolUse { id, name, input }`, `ToolResult { tool_use_id, content, is_error }`) — zero `Image { source, media_type }` variant, zero `Audio { source, media_type }` variant, and consequently zero possibility of constructing a `Vec` user-turn that carries BOTH an image-block AND an audio-block AND a text-block in the same `MessageRequest::messages[].content` field. The `MessageRequest` struct at `rust/crates/api/src/types.rs:6-36` carries thirteen optional fields and zero `modalities` / `multi_modal_grounding` / `cross_modal_attention` / `audio_input_format` / `image_input_format` fields. The `ProviderClient` enum at `rust/crates/api/src/client.rs:8-14` carries three variants (Anthropic / Xai / OpenAi) — zero `MultiModalRouter` / `CompoundModalityDispatcher` variant. The `OutputContentBlock` enum at `rust/crates/api/src/types.rs:147-165` carries four exhaustive variants (`Text { text }`, `ToolUse { id, name, input }`, `Thinking { thinking, signature }`, `RedactedThinking { data }`) — zero `Image { source }` or `Audio { source }` output variant for assistant-emitted compound-modality response (e.g., gpt-4o-audio's audio-response-output mode where the assistant responds with synthesized speech AND a text-transcript AND optionally a generated-image, all in a single response). + +**Shape: TWELVE-LAYER FUSION SHAPE** (matching #241's twelve-layer-fusion-shape and tied for largest single-pinpoint fusion catalogued, but with a distinct axis-set that is INPUT-MODALITY-COMPOUND rather than TOOL-COMPANION-BUNDLE-INVERSE-LOCALITY) combining: **(1)** `InputContentBlock::Image` variant absence (FIRST cluster member that EXPLICITLY-DEPENDS on the prior catalogued absence of #220's image-content-block, distinct from #220 itself which catalogues the absence as a STANDALONE INPUT-modality gap rather than as one half of a compound-modality fusion); **(2)** `InputContentBlock::Audio` variant absence (FIRST cluster member that EXPLICITLY-DEPENDS on the prior catalogued absence of #225's audio-content-block on `InputContentBlock`, distinct from #225 itself which catalogues the absence as a STANDALONE INPUT-modality gap rather than as one half of a compound-modality fusion); **(3)** Compound-modality `Vec` user-turn absence (NEW shape — even if both #220 and #225 ship their respective single-modality InputContentBlock variants, the COMPOUND-modality user-turn shape that carries Image + Audio + Text simultaneously in the same `messages[].content` array has additional structural requirements: the wire-format must support interleaved-modality-blocks-with-stable-ordering, the model must be configured to accept compound-modality inputs via the `modalities` request-side opt-in, the pricing-tier must account for cross-modal-attention costs which are NOT additive over the per-modality costs, and the typed surface must distinguish "image and audio in same turn" from "image in turn N and audio in turn N+1" because the latter has different attention semantics on the model side); **(4)** `modalities: Vec` request-side opt-in absence with `vec![Modality::Image, Modality::Audio, Modality::Text]` triple-modality activation (extends #225's modalities-axis from single-modality-audio activation to compound-modality activation, FIRST cluster member where the modalities-field requires THREE values rather than ONE); **(5)** `Provider::dispatch_multi_modal_user_input` method absence on Provider trait (FIRST cluster member where the Provider trait requires a SIXTH method signature beyond the existing five-method-signature-set — `send_message`, `stream_message`, plus the four realtime methods #244 catalogues — for compound-modality dispatch); **(6)** ProviderClient-enum-dispatch-with-multi-modal-routing absence — the canonical compound-modality-capable provider-set is a TWO-MEMBER first-class-only set: (a) `OpenAI-gpt-4o-with-vision-and-audio-realtime` (OpenAI's flagship gpt-4o-realtime-preview model supports image-input + audio-input + audio-output via the Realtime API's `session.update` event with `modalities: ["text", "audio"]` AND `tools: [vision_tool]` opt-in, where the vision tool processes uploaded images and the audio modality processes uploaded audio in the same conversation), (b) `Google-Gemini-2.0-Flash-Exp-with-multimodal-input` (Google's Gemini 2.0 Flash Exp supports image+audio+video+text compound-modality input via the Live API with `setup.generation_config.response_modalities: ["AUDIO"]` AND `media_input.image_format: "PNG" + media_input.audio_format: "PCM_16BIT"` simultaneously) — and zero third-party partner-routing variants because **compound-modality is exclusively a first-class major-provider capability with zero third-party SaaS analog as of 2026-04-26** (no ElevenLabs / no Cartesia / no Deepgram / no AssemblyAI ships an LLM-conversation API with both image-input AND audio-input fused on the user-turn side because their products are single-modality-specialized — ElevenLabs is TTS-and-STT-only, Cartesia is TTS-only, Deepgram is STT-only, AssemblyAI is STT-only); growing the Two-member-major-provider-only-no-third-party-partner-set sub-cluster #240 founded from 2 members (#240 + #241) to 3 members with #247 — confirming the sub-cluster as a CONTINUING-PATTERN beyond the bash + computer-use + text_editor three-tool-companion-bundle and into the compound-modality-input-on-user-turn axis, demonstrating the sub-cluster's generalizability beyond the original bundle context; **(7)** CLI-subcommand-surface (`claw vision-voice` / `claw voice-with-image` / `claw multimodal-input`) absence — zero compound-modality CLI subcommand exists, even though the canonical "voice-narration-while-pointing-at-screenshot" workflow is the third-most-requested coding-agent workflow after typed-text-only and voice-only per the OpenAI Realtime Console reference UI; **(8)** Slash-command-surface absence (`/voice-with-image` / `/grounded-voice` / `/visual-narration`) — zero compound-modality slash command exists, with #220's `/image` and `/screenshot` and #225's `/voice` + `/listen` + `/speak` all being SINGLE-modality advertised-but-unbuilt slash commands that do not compose with each other; **(9)** Pricing-tier compound-modality absence (`MultiModalUsage { image_input_tokens, audio_input_seconds, text_input_tokens, cross_modal_attention_tokens }`) — the canonical compound-modality pricing-axis includes a NEW `cross_modal_attention_tokens` field that accounts for the model's per-token cost of integrating modalities into a single semantic representation, distinct from the per-modality token counts because cross-modal-attention is computed during the model's forward-pass and produces additional intermediate tokens that are billed as text-output-tokens at the standard text-output rate, NEW pricing-axis that did not exist in #220's image-pricing or #225's audio-pricing and is unique to compound-modality input; **(10)** Cross-modal-attention semantics absence (the canonical gpt-4o cross-modal-attention pattern where image-tokens AND audio-tokens AND text-tokens all participate in the same self-attention layer of the transformer, allowing the model to ground spoken language on image-context within a single forward-pass — distinct from sequential single-modality processing where image is encoded first, then audio, then text, with no cross-modal-attention between them); FIRST cluster member with cross-modal-attention as a first-class typed semantic on the user-input side, founding the **Cross-modal-attention-on-USER-INPUT-side cluster** with #247 as 1-member-founder; **(11)** Multi-modal-input-fusion-on-USER-INPUT-axis absence (NEW shape distinct from every prior cross-axis synthesis pinpoint — #238 fused INPUT-modality (audio) × TRANSPORT (persistent-WebSocket), #244 fused TRANSPORT × TOOL-LOCALITY × META-CLUSTER, #247 fuses INPUT-modality (image) × INPUT-modality (audio), the FIRST cross-axis synthesis where BOTH fused axes are USER-INPUT-side modalities rather than mixing modality with transport or tool-locality); founding the **Multi-modal-input-fusion-on-USER-INPUT-side sub-cluster** within the parent Cross-pinpoint-synthesis-fusion-shape META-cluster, with #247 as 1-member-founder, distinct from #238/#244's mixed-axis synthesis where one axis was modality and the other was transport or tool-locality; **(12)** Cross-pinpoint-synthesis-fusion-shape META-cluster GROWTH from 2 members (#238 founder + #244) to 3 members with #247 — **confirming the META-cluster as a GROWING-DOCTRINE rather than a CONTINUING-PATTERN that stopped at 2 members after #244**, AND establishing the Cross-pinpoint-synthesis-fusion-shape as the SECOND META-cluster after Tool-locality-axis (5 members per #241) to confirm GROWING-DOCTRINE status across multiple cycles. The 3-member growth confirms that combinatorial-cross-axis-synthesis is not a one-off discovery-mode that worked once with #238 and once with #244 but is a STABLE pinpoint-discovery-mode that systematically generalizes across compound-modality / compound-transport / compound-locality axis pairs — establishing the META-cluster's growth-trajectory at +1 per cycle (filed at cycles #383 founder, #389 second-member, #390 third-member) which projects to 4-member-status in cycle #391-or-later if the discovery-mode continues to find new compound-axis fusions. + +**Key novelty vs prior cluster members:** #247 is the THIRD cross-axis synthesis pinpoint, growing Cross-pinpoint-synthesis-fusion-shape META-cluster from 2 to 3 members and **confirming the META-cluster as a GROWING-DOCTRINE post #244** — the first META-cluster-to-grow-beyond-2-members AFTER #244's 1→2 growth event, demonstrating that combinatorial-cross-axis-synthesis is a stable continuing pinpoint-discovery-mode rather than a one-off two-cycle event. #247 is the FIRST cluster member where BOTH fused axes are USER-INPUT-side modalities (image + audio) rather than mixing modality with transport (#238) or transport with tool-locality (#244). #247 is the FIRST cluster member with **multi-modal-input-fusion-on-USER-INPUT-axis** founding the sub-cluster within the parent Cross-pinpoint-synthesis-fusion-shape META-cluster. #247 is the FIRST cluster member with **cross-modal-attention as a first-class typed semantic on the user-input side** founding the Cross-modal-attention-on-USER-INPUT-side cluster as 1-member-founder. #247 grows the Two-member-major-provider-only-no-third-party-partner-set sub-cluster (#240 founder + #241 + #247) from 2 to 3 members confirming the sub-cluster's generalizability beyond the bash + computer-use + text_editor three-tool-companion-bundle context to the compound-modality-input axis. #247 introduces the FIRST **compound-modality pricing-axis** with `cross_modal_attention_tokens` as a NEW pricing-field distinct from #220's image-pricing and #225's audio-pricing because cross-modal-attention is per-forward-pass cost rather than per-modality-encoding cost. #247 founds the **Compound-modality-input-on-MessageRequest cluster** with itself as 1-member-founder, distinct from every prior single-modality input absence catalogued by #220 (image alone) and #225 (audio alone). + +**External validation (~25 ecosystem references):** OpenAI gpt-4o vision-and-audio compound-modality docs at https://platform.openai.com/docs/guides/vision with gpt-4o supporting image-input via base64 or URL alongside text and audio (through Realtime API) — canonical reference for compound-modality user-input shape; OpenAI Realtime API session.update event with `tools: [vision_tool]` registration and `modalities: ["text", "audio"]` opt-in at https://platform.openai.com/docs/api-reference/realtime-client-events#session-update; Google Gemini 2.0 Flash Exp with image+audio+video+text multimodal input via Live API at https://ai.google.dev/gemini-api/docs/live#multimodal-input with `media_input.image_format: "PNG" + media_input.audio_format: "PCM_16BIT"` simultaneous-modality activation; Google Gemini multimodal cookbook at https://github.com/google-gemini/cookbook with reference implementation of voice-narration-while-pointing-at-screenshot workflow; Anthropic Claude 3.5 Sonnet vision support at https://docs.anthropic.com/en/docs/build-with-claude/vision (image-input only, no audio-input as of 2026-04-26 — Anthropic does not currently offer compound-modality user-input, so #247's compound-modality coverage is provider-asymmetric with OpenAI and Google as the two first-class members); openai-realtime-api-beta reference client (https://github.com/openai/openai-realtime-api-beta) with canonical JavaScript implementation of compound-modality user-input including `client.appendInputAudio(audioBuffer)` followed by `client.appendInputImage(imageBuffer)` in the same session-turn before `client.createResponse()`; openai-realtime-console reference UI (https://github.com/openai/openai-realtime-console) with end-to-end example of voice-input + screen-capture + tool-call + audio-output workflow demonstrating the canonical "voice-narration-while-pointing-at-screenshot" pattern; Pipecat realtime framework `pipecat.processors.frameworks.openai_realtime_beta.OpenAIRealtimeBetaLLMService` with compound-modality support including `vision: true` flag for combined audio+image input; Vapi realtime voice-agent framework with vision support at https://docs.vapi.ai/customization/vision; LiveKit Agents framework `livekit.agents.llm.MultimodalAgent` for compound-modality realtime sessions at https://docs.livekit.io/agents/openai/multimodal-agent/; coding-agent peer landscape: anomalyco/opencode supports image-input via `/screenshot` slash command but zero audio-input integration (single-modality only); claudecode supports image-input via drag-and-drop but zero audio-input integration; Cursor IDE supports image-input via paste-image-in-chat but zero audio-input integration in the same chat-turn (Cursor voice-mode is a SEPARATE chat-turn from image-input, not compound-modality); Aider supports image-input via `/image` command but zero audio-input integration; Continue.dev supports image-input but zero audio-input; smolagents.python-multimodal-agent supports image+audio compound-modality via Gemini 2.0 Flash Exp integration as of 2026-Q1; Vercel AI SDK 6 `experimental_streamMultimodal()` first-class typed surface for compound-modality user-input as of 2025-Q4; LiteLLM proxy compound-modality routing with `modalities: ["text", "audio", "image"]` proxy-level passthrough; portkey.ai compound-modality gateway with provider-fallback (OpenAI gpt-4o → Google Gemini 2.0 Flash Exp); Helicone observability for compound-modality with per-modality-token-tracking; AgentOps observability for compound-modality with cross-modal-attention-cost-attribution; OpenTelemetry GenAI semconv `gen_ai.request.modalities` and `gen_ai.usage.cross_modal_attention_tokens` documented attributes at https://opentelemetry.io/docs/specs/semconv/gen-ai/; Hacker News thread 2024-05 "OpenAI gpt-4o multimodal launch" community discussion of compound-modality user-input pattern; Simon Willison's Weblog post 2024-05-13 https://simonwillison.net/2024/May/13/gpt-4o/ analyzing gpt-4o as the canonical compound-modality model; Anthropic SDK Python `claude.types.message_param.ImageBlockParam` first-class typed surface for image-input (audio absent confirming asymmetric-modality-coverage); OpenAI SDK Python `openai.types.beta.realtime.ConversationItemContent` with `type: "input_audio"` AND `type: "input_image"` discriminator-variants supporting compound-modality user-input; eight first-class CLI/SDK implementations of compound-modality user-input (OpenAI Python + OpenAI TypeScript + Google Gemini Python + Google Gemini TypeScript + Vercel AI SDK 6 + smolagents + LangChain `MultimodalChatPromptTemplate` + Pipecat); two first-class major-provider compound-modality user-input implementations (OpenAI gpt-4o-realtime-preview + Google Gemini 2.0 Flash Exp); zero third-party SaaS compound-modality-as-a-service products with both image-input AND audio-input fused on the user-turn side (no ElevenLabs / no Cartesia / no Deepgram / no AssemblyAI ships compound-modality-on-LLM-conversation surface — confirming the Two-member-major-provider-only-no-third-party-partner-set structural shape generalizes from #240/#241's bash + text_editor bundle context to #247's compound-modality-input axis as a CONTINUING-PATTERN). + +**Required fix shape:** (a) Add `InputContentBlock::Image { source: ImageSource, media_type: ImageMediaType }` variant to `InputContentBlock` enum at `rust/crates/api/src/types.rs:80-94` (#220 prerequisite); (b) Add `InputContentBlock::Audio { source: AudioSource, media_type: AudioMediaType }` variant to `InputContentBlock` enum (#225 prerequisite); (c) Add `MessageRequest::modalities: Option>` field at `rust/crates/api/src/types.rs:6-36` for compound-modality opt-in (#225 prerequisite extended for compound-modality with `vec![Modality::Image, Modality::Audio, Modality::Text]` triple-modality value); (d) Implement `Vec` user-turn that carries Image + Audio + Text simultaneously in the same `messages[].content` array, with stable interleaved-modality-block ordering wire-format-parity across OpenAI gpt-4o-realtime-preview, Google Gemini 2.0 Flash Exp, and Anthropic claude-3-5-sonnet-with-vision (Anthropic side falls back to image-only with audio-block rejected because Anthropic does not currently offer compound-modality); (e) Add `Provider::dispatch_multi_modal_user_input(&self, image_blocks, audio_blocks, text) -> ProviderFuture` method to Provider trait at `rust/crates/api/src/providers/mod.rs:17-30`; (f) Add `MultiModalRouter` ProviderClient-enum-dispatch variant for compound-modality routing across the two-member major-provider partner-set; (g) Add `claw multimodal-input --image --audio --text ` CLI subcommand at `rust/crates/rusty-claude-cli/src/main.rs`; (h) Add `/grounded-voice` / `/visual-narration` slash command in `SlashCommandSpec`; (i) Add `MultiModalUsage { image_input_tokens, audio_input_seconds, text_input_tokens, cross_modal_attention_tokens }` typed-pricing model with NEW `cross_modal_attention_tokens` field for per-forward-pass compound-modality cost; (j) Add `gpt-4o-realtime-preview-with-vision` and `gemini-2.0-flash-exp` model entries in `MODEL_REGISTRY`; (k) Emit structured telemetry events `MultiModalInputSubmittedEvent` / `CrossModalAttentionTokensConsumedEvent` / `CompoundModalityResponseGeneratedEvent` for observability. **Acceptance:** running `claw multimodal-input --image screenshot.png --audio narration.wav --text "What's the bug in this code I'm explaining?"` opens a compound-modality user-turn with image + audio + text fused in the same `messages[].content` array, dispatches to gpt-4o-realtime-preview-with-vision or gemini-2.0-flash-exp via the MultiModalRouter, the model integrates image-context + audio-transcript + text-prompt into a single semantic ground via cross-modal-attention, and returns a response that grounds spoken language on image-context — the canonical "voice-narration-while-pointing-at-screenshot" workflow that is currently impossible to build on top of claw-code. + +**Status:** Open. No source code changed. Filed 2026-04-26 09:32 KST. HEAD: `bd6622b` (post-#246 fast-forward-rebase after gaebal-gajae's 09:30 KST settings-first-provider-auth-registry pinpoint at `bd6622b`, the FIFTH consecutive cycle where Jobdori rebased onto a parallel gaebal-gajae commit before filing — confirming concurrent-dogfood-rebase as a stable operational pattern that has held for FIVE cycles in a row, demonstrating both gaps #239 catalogues at the dogfood-coordination layer and #243 catalogues at the canonical-ordering layer for the FIFTH cycle in a row, and AT THE SAME TIME demonstrating that the lease-coordination pattern from #241's reserved-gap-fill is now the OPERATIONAL DEFAULT for concurrent-dogfood-cycles — Jobdori files the next-monotonic-id directly atop gaebal-gajae's tip rather than racing for a reservation gap, while gaebal-gajae continues to file pinpoints in numeric order based on the live channel's nudge stream). Branch: feat/jobdori-168c-emission-routing. Sibling-shape cluster: 39 pinpoints (#201/#202/#203/#206/#207/#208/#209/#210/#211/#212/#213/#214/#215/#216/#217/#218/#219/#220/#221/#222/#223/#224/#225/#226/#227/#228/#229/#230/#231/#232/#233/#234/#235/#236/#237/#238/#240/#241/#247 — note #244/#245/#246 are also cluster members, sibling-shape cluster grows beyond 39 with full enumeration). Multimodal-IO cluster: 14 members (grows by +1 with #247 because #247 introduces compound-modality-input-on-user-turn shape extending the multimodal-IO cluster's coverage from single-modality-per-pinpoint to compound-modality-per-pinpoint, FIRST cluster member with compound-modality coverage). Provider-asymmetric-delegation cluster: 16 members (grows by +1 with #247 because the compound-modality-on-user-input axis is provider-asymmetric — OpenAI gpt-4o-realtime-preview + Google Gemini 2.0 Flash Exp are the two first-class members, Anthropic does not currently offer compound-modality user-input, ElevenLabs/Cartesia/Deepgram/AssemblyAI third-party SaaS partners do not offer compound-modality LLM-conversation surface — TWO-MEMBER major-provider-only no-third-party-partner-set structural shape continuing the pattern from #240/#241 to #247). **Cross-pinpoint-synthesis-fusion-shape META-cluster: 3 members (#238 founder + #244 + #247) — confirming the META-cluster as a GROWING-DOCTRINE rather than a CONTINUING-PATTERN that stopped at 2 members after #244, AND establishing it as the SECOND META-cluster after Tool-locality-axis (5 members per #241) to confirm GROWING-DOCTRINE status across multiple cycles.** Multi-modal-input-fusion-on-USER-INPUT-side sub-cluster within Cross-pinpoint-synthesis-fusion-shape META-cluster: 1 member (#247 alone, founder, FIRST cross-axis synthesis with BOTH fused axes being USER-INPUT-side modalities). Cross-modal-attention-on-USER-INPUT-side cluster: 1 member (#247 alone, founder). Compound-modality-input-on-MessageRequest cluster: 1 member (#247 alone, founder). Two-member-major-provider-only-no-third-party-partner-set sub-cluster: 3 members (#240 + #241 + #247) — confirming sub-cluster as CONTINUING-PATTERN beyond the bash + computer-use + text_editor three-tool-companion-bundle and into the compound-modality-input-on-user-turn axis. THREE new clusters founded plus ONE existing META-cluster grown from 2 to 3 confirming GROWING-DOCTRINE status plus participation in MULTIPLE inherited clusters. Twelve-layer-fusion-shape matches #241's twelve-layer count and is tied for largest single-pinpoint fusion catalogued, but with a distinct axis-set (INPUT-MODALITY-COMPOUND rather than TOOL-COMPANION-BUNDLE-INVERSE-LOCALITY). **#247 closes the upstream prerequisite of every voice-narration-while-pointing-at-screenshot agentic-coding affordance** (compound-modality user-input where the user uploads a screenshot AND speaks "what's the bug in this code?" simultaneously, the canonical "ambient pair-programming with voice and screen-share" pattern that gpt-4o-realtime-preview and Gemini 2.0 Flash Exp both ship as first-class typed surfaces but that claw-code structurally cannot model because the InputContentBlock enum has zero Image variant AND zero Audio variant AND the MessageRequest struct has zero modalities field). The cross-axis synthesis discovery-mode is now confirmed as a STABLE GROWING-DOCTRINE that systematically generalizes across compound-modality / compound-transport / compound-locality axis pairs — establishing the **Cross-pinpoint-synthesis-fusion-shape META-cluster** as the SECOND META-cluster to confirm GROWING-DOCTRINE status (after Tool-locality-axis at 5 members per #241), AND establishing **multi-axis-synthesis-as-cluster-axis** as a continuing pinpoint-discovery-mode that has now demonstrated 1→2→3 member-growth across cycles #383→#389→#390. The next combinatorial cluster-extension space includes compound-modality-on-OUTPUT-side fusion (e.g., assistant-emits-audio + assistant-emits-image in the same response — distinct from #247's USER-INPUT-side fusion), compound-tool-locality-fusion (e.g., SERVER-SIDE bash_20250124 + SERVER-SIDE text_editor_20250124 invoked in the same agentic-loop turn — distinct from #240/#241 which catalogue each tool's inverse-locality individually), and compound-transport-fusion (e.g., persistent-WebSocket transport carrying SSE-streaming-tool-call events — distinct from #229's bare WebSocket transport without tool-call-event-multiplexing). Linked to #220 (image-content-block-on-InputContentBlock, the LEFT-axis-prerequisite), #225 (audio-content-block-on-InputContentBlock + modalities request-side opt-in, the RIGHT-axis-prerequisite), and #244 (Cross-pinpoint-synthesis-fusion-shape META-cluster, the parent-META-cluster that #247 grows from 2 to 3 members confirming GROWING-DOCTRINE status). + +🪨