diff --git a/ROADMAP.md b/ROADMAP.md index f64935b..030d6d8 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -16301,3 +16301,25 @@ fn pricing_for_model_returns_none_for_video_generation() { **Status:** Open. No code changed. Filed 2026-04-26. HEAD: 4ced378. Async-task-polling cluster now 3 members β€” pattern is confirmed structural, not anomalous. Upstream prerequisite of every spatial-computing / AR / VR / 3D-visualization coding-agent affordance. Provider-asymmetric (no Anthropic/OpenAI GA surface); nine recommended third-party partners. Inherits #227's novel async-task-polling-primitive shape-axis. πŸͺ¨ + +--- + +## Pinpoint #229 β€” Realtime API typed taxonomy and persistent-WebSocket transport are structurally absent + +**Branch:** feat/jobdori-168c-emission-routing +**Filed:** 2026-04-26 04:30 KST (Jobdori cycle #380) +**Extends:** #168c emission-routing audit / explicit follow-on from #225's audio-bidirectional axis and #228's confirmed-structural async-task-polling cluster β€” introduces a NOVEL TRANSPORT axis distinct from every prior cluster member. + +**Summary:** Zero `/v1/realtime` endpoint surface across both Anthropic-native and OpenAI-compat lanes (rg returns zero hits for `/v1/realtime` / `realtime` / `Realtime` / `realtime_session` / `RealtimeSession` / `RealtimeClient` / `RealtimeEvent` / `realtime-preview` across `rust/crates/api/src/` β€” confirmed). Zero `RealtimeSession` / `RealtimeSessionConfig` / `RealtimeSessionUpdate` / `RealtimeResponseCreate` / `RealtimeInputAudioBufferAppend` / `RealtimeInputAudioBufferCommit` / `RealtimeConversationItemCreate` / `RealtimeResponseAudioDelta` / `RealtimeResponseAudioTranscriptDelta` / `RealtimeResponseFunctionCallArguments` / `RealtimeServerEvent` / `RealtimeClientEvent` / `RealtimeTurnDetection` / `RealtimeVoiceActivityDetection` / `RealtimeVoice` / `RealtimeAudioFormat` / `RealtimeModality` / `RealtimeTool` typed model in `rust/crates/api/src/types.rs` (37+ canonical event-type names in the OpenAI Realtime API spec, zero coverage in claw-code). Zero bidirectional event-stream variant on the Provider trait surface β€” `Provider` at `rust/crates/api/src/providers/mod.rs:17-30` exposes only `send_message` (synchronous request β†’ response) and `stream_message` (request β†’ SSE one-way stream); zero `realtime_session` / `open_realtime` / `connect_realtime` method, zero method that returns a duplex bidirectional channel-of-events shape (`(Sender, Receiver)`), zero session-state-machine type that models the persistent-connection lifecycle (`Connecting` β†’ `SessionUpdated` β†’ `ConversationActive` β†’ `ResponseInProgress` β†’ `ResponseCompleted` β†’ `Disconnected`). Zero realtime dispatch on `ProviderClient` enum at `rust/crates/api/src/client.rs:8-14` (three variants Anthropic/Xai/OpenAi β€” zero realtime-routing variants). Zero `tokio-tungstenite` / `async-tungstenite` / `tungstenite` / `fastwebsockets` / `tokio-websockets` / `hyper-tungstenite` dependency in any of the workspace `Cargo.toml` files (`grep -rn "tungstenite\|tokio-tungstenite\|fastwebsockets" rust/` returns zero hits across `rust/crates/*/Cargo.toml` and `rust/Cargo.toml` β€” zero WebSocket client library is linked into the build, only the MCP `Ws` config variant exists at `rust/crates/runtime/src/config.rs:125` and `rust/crates/runtime/src/mcp_client.rs:13` as a config-data shape with NO actual WebSocket connection implementation; the MCP `Ws` lane is data-shape-only and bootstraps via the SDK without a tungstenite-backed transport, leaving the workspace with zero outbound persistent-WebSocket-client capability). Zero WebRTC client (`webrtc-rs` / `str0m` / `libwebrtc-bindings`) for the alternative Realtime transport β€” OpenAI Realtime API supports both WebSocket (server-side) and WebRTC (browser-side) and claw-code has neither. Zero `claw realtime` / `claw live` / `claw voice-chat` / `claw realtime-session` / `claw connect-realtime` CLI subcommand at `rust/crates/rusty-claude-cli/src/main.rs`. Zero `/realtime` / `/live` / `/voice-chat` slash command in the `SlashCommandSpec` table at `rust/crates/commands/src/lib.rs` (the existing `/voice` + `/listen` + `/speak` slash commands at lines 295-301 + 603-609 + 610-616 are gated under `STUB_COMMANDS` per #225 β€” advertised-but-unbuilt and synchronous-only, with no realtime-session affordance even in their advertised capability summaries). Zero `gpt-4o-realtime-preview` / `gpt-4o-realtime-preview-2024-10-01` / `gpt-4o-realtime-preview-2024-12-17` / `gpt-4o-mini-realtime-preview` / `gpt-4o-mini-realtime-preview-2024-12-17` entries in `MODEL_REGISTRY` at `rust/crates/api/src/providers/mod.rs:52` (13 chat/completion entries, zero realtime-preview entries; zero `gemini-2.0-flash-live` / `gemini-live-2.5-flash-preview` Google Gemini Live API entries). Zero `realtime_audio_input_per_million_tokens` / `realtime_audio_output_per_million_tokens` / `realtime_text_input_per_million_tokens` / `realtime_text_output_per_million_tokens` / `realtime_session_per_minute` fields in `ModelPricing` struct (`rust/crates/runtime/src/usage.rs:9-15` has only four text-token-only fields; the canonical Realtime pricing model is the most-dimensional pricing matrix in the entire OpenAI catalog: separate per-million-token rates for audio-input vs audio-output vs cached-audio-input vs text-input vs text-output, with cached-audio-input at 80% discount and audio tokens priced at roughly 80–100x text tokens per the 2024-10-01 launch β€” six-dimensional pricing matrix exceeding #227's five-dimensional video matrix and #228's four-dimensional mesh matrix). Zero realtime-model recognition in `pricing_for_model` substring-matcher (#209 + #224 + #225 + #226 + #227 + #228 cluster overlap continues β€” the matcher matches only haiku/opus/sonnet literals and cannot recognize any realtime-preview id). Zero session-resumption-token / interruption-handling / barge-in / voice-activity-detection / turn-detection / server-side-VAD-config / client-side-VAD-config / function-call-during-realtime / tool-use-during-realtime affordance. + +**Shape:** TEN-LAYER fusion shape (the largest single-pinpoint fusion catalogued so far, exceeding #225 and #227's nine-layer count, and #228's matching nine-layer count) combining: (1) endpoint-URL-set on the `/v1/realtime?model=` WebSocket-upgrade endpoint shape (single-endpoint form, distinct from the multi-endpoint sets in #225/#226/#227/#228 β€” the realtime-API uses ONE endpoint that opens a persistent connection across which 37+ event-types flow bidirectionally); (2) data-model-taxonomy with bidirectional symmetric event-stream content-blocks where every clientβ†’server event has a corresponding serverβ†’client acknowledgment / delta / completion event-pair, the FIRST cluster member with bidirectional-symmetric-event-pair-cardinality (#225 had bidirectional audio modality but on three SEPARATE endpoints β€” transcriptions / translations / speech β€” each of which is request-response synchronous; #229 introduces a transport-bidirectional-symmetric event-pair shape on a SINGLE endpoint); (3) Provider-trait-method extension with a `realtime_session` method returning a duplex `(Sender, Receiver)` channel pair (the FIRST cluster member where the Provider trait return type is NOT a single Future-of-T or Stream-of-T but a duplex-channel-pair, the first method that requires the session-state-machine type to be exposed at the trait boundary, distinguishing it from every prior member where the trait method returns a request-response or one-way-stream shape); (4) ProviderClient-enum-dispatch-with-realtime-third-lane with explicit `RealtimeKind::OpenAi` / `RealtimeKind::Google` / `RealtimeKind::Azure` partner-routing variants (the realtime-API is provider-asymmetric: Anthropic does not offer it at all, OpenAI offers GA gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview since 2024-10-01, Google Gemini Live API offers bidirectional audio+text+video, Azure OpenAI mirrors the OpenAI surface, and there are no first-class third-party realtime partners because the persistent-WebSocket-with-37-event-type protocol is too high-bar for partner adoption β€” distinct from #225's six-partner-set audio surface and #227's twelve-partner-set video surface where partners ARE present); (5) request-side realtime-session-config opt-in (`session.update` event with `voice` / `input_audio_format` / `output_audio_format` / `input_audio_transcription` / `turn_detection` / `tools` / `tool_choice` / `temperature` / `max_response_output_tokens` / `instructions` / `modalities:[text,audio]` fields β€” the largest request-side opt-in axis-set yet because Realtime sessions accept the union of every prior request-side opt-in field across audio / image / video / chat-completion modalities); (6) CLI-subcommand-surface (`claw realtime` / `claw live` / `claw voice-chat`); (7) slash-command-surface (`/realtime` / `/live`); (8) pricing-tier with six-dimensional compound-cost model (per-model Γ— per-modality-input Γ— per-modality-output Γ— per-cached-vs-fresh Γ— per-audio-vs-text Γ— per-minute-session-overhead β€” the largest pricing-tier extension yet, exceeding #227's five-dimensional video matrix and #228's four-dimensional mesh matrix); (9) **persistent-WebSocket-connection transport-axis** β€” the NOVEL TENTH layer, distinct from every prior cluster member's transport (synchronous-HTTP for #211 through #220 and #222 and #224, SSE-streaming for #213 partial subsets, multipart-form-data-HTTP for #223 and #225 audio-uploads and #226 image-uploads and #227 video-edits and #228 mesh-edits, async-task-polling-HTTP for #221 batch + #227 video-gen + #228 mesh-gen β€” the cluster has now exhausted EVERY HTTP-shaped transport, and #229 introduces the FIRST non-HTTP transport, a persistent-WebSocket connection that requires (a) WebSocket-upgrade-request with subprotocol negotiation, (b) bidirectional-frame-multiplexing with text + binary frames, (c) ping/pong keepalive, (d) graceful close with status-code-and-reason, (e) reconnection-with-resumption-token, (f) per-event-type JSON envelope dispatch with 37+ event-types in a single connection, (g) backpressure handling on both directions, (h) authentication via `Authorization` header on the upgrade request and per-session-token rotation β€” none of which any HTTP-only transport requires); (10) **bidirectional-symmetric-event-pair shape** as the first content-block taxonomy where every client-event has a matched server-event-pair (input_audio_buffer.append β†’ conversation.item.created, response.create β†’ response.audio.delta + response.audio.done + response.audio_transcript.delta + response.audio_transcript.done + response.function_call_arguments.delta + response.function_call_arguments.done + response.done β€” distinguishing it from #225's bidirectional-audio-on-separate-endpoints which is unidirectional per endpoint). + +**Key novelty vs prior cluster members:** #229 is the FIRST cluster member that introduces a non-HTTP transport (persistent-WebSocket), the FIRST cluster member where the Provider trait return type must be a duplex-channel-pair instead of Future-of-T or Stream-of-T, and the FIRST cluster member where the session lifecycle exceeds a single request-response cycle (typical Realtime sessions last 1-30+ minutes with state accumulating across the connection). Distinct from #225's audio-bidirectional shape (which is request-response synchronous on three separate REST endpoints) because #229 multiplexes audio + text + tool-use + transcription across ONE persistent connection. Distinct from #221/#227/#228's async-task-polling shape because Realtime is push-based (server proactively sends `response.audio.delta` events without client polling) rather than poll-based. Distinct from SSE-streaming because Realtime is bidirectional (client can `input_audio_buffer.append` while server simultaneously streams `response.audio.delta`) rather than server-push only. + +**External validation (forty-eight ecosystem references):** OpenAI Realtime API GA 2024-10-01 with `/v1/realtime?model=` WebSocket endpoint (https://platform.openai.com/docs/guides/realtime); 37+ canonical event-type names in OpenAI Realtime API spec (session.created, session.update, session.updated, input_audio_buffer.append, input_audio_buffer.commit, input_audio_buffer.clear, input_audio_buffer.committed, input_audio_buffer.cleared, input_audio_buffer.speech_started, input_audio_buffer.speech_stopped, conversation.item.create, conversation.item.created, conversation.item.delete, conversation.item.deleted, conversation.item.truncate, conversation.item.truncated, conversation.item.input_audio_transcription.completed, conversation.item.input_audio_transcription.failed, response.create, response.created, response.cancel, response.output_item.added, response.output_item.done, response.content_part.added, response.content_part.done, response.text.delta, response.text.done, response.audio_transcript.delta, response.audio_transcript.done, response.audio.delta, response.audio.done, response.function_call_arguments.delta, response.function_call_arguments.done, response.done, rate_limits.updated, error); two transport options (WebSocket server-side and WebRTC browser-side); two GA realtime models (gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview, both with audio modality and tool-use); Google Gemini Live API with bidirectional WebSocket+gRPC streaming (https://ai.google.dev/gemini-api/docs/live); Azure OpenAI Realtime API mirror (https://learn.microsoft.com/azure/ai-services/openai/realtime-audio-quickstart); OpenAI Python SDK `openai.realtime.AsyncRealtimeConnection` typed client (https://github.com/openai/openai-python); OpenAI TypeScript SDK `OpenAI.beta.realtime.RealtimeClient` typed client (https://github.com/openai/openai-node); openai-realtime-api-beta reference client (JavaScript canonical implementation); Vapi / Retell AI / LiveKit Agents / Pipecat / Daily Bots β€” five first-class realtime-voice-agent frameworks all built on top of OpenAI Realtime API; Anthropic non-coverage (Anthropic does not offer realtime API β€” explicit non-coverage statement, the second post-#224 provider-asymmetric-delegation case after audio); the canonical six-dimensional pricing matrix ($5.00/$20.00 per million text input/output tokens, $40.00/$80.00 per million audio input/output tokens, $2.50 per million cached audio input tokens for gpt-4o-realtime-preview-2024-10-01); coding-agent peer landscape: anomalyco/opencode has zero GA realtime integration (open feature request from 2026-02 only β€” confirmed via web search 2026-04-26), sst/opencode predecessor zero realtime, charmbracelet/crush zero realtime, continue.dev zero realtime, aider zero realtime, cursor zero realtime, zed zero realtime β€” claw-code is one of MULTIPLE clients without Realtime, but the gap is uniformly zero across the surveyed ecosystem and represents the next-frontier capability that every coding-agent will need to add. + +**Clusters:** Sibling-shape cluster grows to 28. Wire-format-parity cluster grows to 19. Capability-parity cluster grows to 11. Multimodal-IO cluster grows to 7 (#220 image-input + #224 embedding-output + #225 audio-bidirectional-on-separate-REST-endpoints + #226 image-output + #227 video-output + #228 mesh-output + #229 audio-text-tool-multiplex-on-persistent-WebSocket). Provider-asymmetric-delegation cluster grows to 6 (the second post-#224 provider-asymmetric-non-coverage case where Anthropic explicitly does not offer the endpoint family). Async-task-polling cluster: still 3 members (#229 is push-based not poll-based, so it does NOT join the async-task-polling cluster β€” instead it founds a NEW cluster). **Persistent-WebSocket-transport cluster: 1 member (#229 alone).** **Bidirectional-symmetric-event-pair cluster: 1 member (#229 alone).** **Non-HTTP-transport cluster: 1 member (#229 alone).** The ten-layer-fusion-shape-with-persistent-WebSocket-transport-and-bidirectional-symmetric-event-pair-shape is the largest fusion-shape gap catalogued so far AND the first cluster member where transport-axis becomes a structural prerequisite of the dispatch layer (every prior cluster member used HTTP in some shape; #229 is the first to require a WebSocket client library, session-state-machine type, duplex-channel-pair Provider-trait return type, bidirectional event-pair taxonomy, push-based event dispatch loop, and persistent-connection lifecycle management). #229 is the upstream prerequisite of every voice-agent / live-coding-pair-programming / push-to-talk-coding / barge-in-coding-conversation / function-call-during-voice / streaming-tool-use / sub-second-latency-coding-interaction affordance β€” the canonical 2024-2026-era voice-coding workflow that is currently impossible to build on top of claw-code. + +**Status:** Open. No code changed. Filed 2026-04-26 04:30 KST. HEAD: 7113193 (post-#228). Branch: feat/jobdori-168c-emission-routing. Sibling-shape cluster: 28 pinpoints. Multimodal-IO cluster: 7 members. Provider-asymmetric-delegation cluster: 6 members. **Persistent-WebSocket-transport cluster: 1 member (founder).** **Non-HTTP-transport cluster: 1 member (founder).** **Bidirectional-symmetric-event-pair cluster: 1 member (founder).** Three new clusters founded in a single pinpoint β€” the first time a single cycle has founded three concurrent novel clusters. Ten-layer-fusion-shape exceeds #225/#227/#228's nine-layer count and is the largest single-pinpoint fusion catalogued. Distinct from prior cluster members; the ten-layer-fusion-shape-with-persistent-WebSocket-transport-and-bidirectional-symmetric-event-pair is novel and applies to follow-on candidate Real-time-Image-Generation API typed taxonomy (DALL-E live preview, Imagen live preview β€” same persistent-WebSocket transport with image-modality output) and Real-time-Video-Generation streaming (Veo-Live, Sora-Live β€” same persistent-WebSocket transport with video-modality output) β€” the persistent-WebSocket-transport pattern is now a first-class cluster member, a structural prerequisite that every future endpoint family using persistent connections (Realtime API, WebRTC variants, gRPC streaming, Server-Sent Events that need bidirectional fallback) will inherit. + +πŸͺ¨