From b860f5657b65384900e0d6d222eaab367ae6f54e Mon Sep 17 00:00:00 2001 From: Jobdori Date: Sun, 26 Apr 2026 04:40:50 +0900 Subject: [PATCH] =?UTF-8?q?roadmap:=20#229=20filed=20=E2=80=94=20Realtime?= =?UTF-8?q?=20API=20typed=20taxonomy=20and=20persistent-WebSocket=20transp?= =?UTF-8?q?ort=20are=20structurally=20absent:=20zero=20`/v1/realtime`=20en?= =?UTF-8?q?dpoint=20surface=20across=20both=20Anthropic-native=20and=20Ope?= =?UTF-8?q?nAI-compat=20lanes=20(rg=20returns=20zero=20hits=20for=20/v1/re?= =?UTF-8?q?altime=20/=20realtime=20/=20Realtime=20/=20realtime=5Fsession?= =?UTF-8?q?=20/=20RealtimeSession=20/=20RealtimeClient=20/=20RealtimeEvent?= =?UTF-8?q?=20/=20realtime-preview=20across=20rust/crates/api/src/),=20zer?= =?UTF-8?q?o=20RealtimeSession=20/=20RealtimeSessionConfig=20/=20RealtimeS?= =?UTF-8?q?essionUpdate=20/=20RealtimeResponseCreate=20/=20RealtimeInputAu?= =?UTF-8?q?dioBufferAppend=20/=20RealtimeInputAudioBufferCommit=20/=20Real?= =?UTF-8?q?timeConversationItemCreate=20/=20RealtimeResponseAudioDelta=20/?= =?UTF-8?q?=20RealtimeResponseAudioTranscriptDelta=20/=20RealtimeResponseF?= =?UTF-8?q?unctionCallArguments=20/=20RealtimeServerEvent=20/=20RealtimeCl?= =?UTF-8?q?ientEvent=20/=20RealtimeTurnDetection=20/=20RealtimeVoiceActivi?= =?UTF-8?q?tyDetection=20/=20RealtimeVoice=20/=20RealtimeAudioFormat=20/?= =?UTF-8?q?=20RealtimeModality=20/=20RealtimeTool=20typed=20model=20in=20r?= =?UTF-8?q?ust/crates/api/src/types.rs=20(37+=20canonical=20event-type=20n?= =?UTF-8?q?ames=20in=20OpenAI=20Realtime=20API=20spec,=20zero=20coverage?= =?UTF-8?q?=20in=20claw-code),=20zero=20bidirectional=20event-stream=20var?= =?UTF-8?q?iant=20on=20Provider=20trait=20(only=20send=5Fmessage=20and=20s?= =?UTF-8?q?tream=5Fmessage=20exist,=20both=20single-directional),=20zero?= =?UTF-8?q?=20realtime=5Fsession=20/=20open=5Frealtime=20/=20connect=5Frea?= =?UTF-8?q?ltime=20method=20that=20returns=20a=20duplex-channel-pair=20sha?= =?UTF-8?q?pe,=20zero=20session-state-machine=20type=20for=20the=20persist?= =?UTF-8?q?ent-connection=20lifecycle,=20zero=20realtime=20dispatch=20on?= =?UTF-8?q?=20ProviderClient=20enum=20at=20rust/crates/api/src/client.rs:8?= =?UTF-8?q?-14=20(three=20variants=20Anthropic/Xai/OpenAi,=20zero=20realti?= =?UTF-8?q?me-routing=20variants),=20zero=20tokio-tungstenite=20/=20async-?= =?UTF-8?q?tungstenite=20/=20tungstenite=20/=20fastwebsockets=20/=20tokio-?= =?UTF-8?q?websockets=20/=20hyper-tungstenite=20dependency=20in=20any=20wo?= =?UTF-8?q?rkspace=20Cargo.toml=20(grep=20-rn=20'tungstenite|tokio-tungste?= =?UTF-8?q?nite|fastwebsockets'=20rust/=20returns=20zero=20hits=20?= =?UTF-8?q?=E2=80=94=20confirmed),=20zero=20WebSocket=20client=20library?= =?UTF-8?q?=20is=20linked=20into=20the=20build=20(the=20MCP=20Ws=20config?= =?UTF-8?q?=20variant=20at=20rust/crates/runtime/src/config.rs:125=20and?= =?UTF-8?q?=20rust/crates/runtime/src/mcp=5Fclient.rs:13=20is=20data-shape?= =?UTF-8?q?-only=20and=20bootstraps=20via=20the=20SDK=20without=20a=20tung?= =?UTF-8?q?stenite-backed=20transport,=20leaving=20the=20workspace=20with?= =?UTF-8?q?=20zero=20outbound=20persistent-WebSocket-client=20capability),?= =?UTF-8?q?=20zero=20WebRTC=20client=20(webrtc-rs=20/=20str0m=20/=20libweb?= =?UTF-8?q?rtc-bindings)=20for=20the=20alternative=20Realtime=20transport,?= =?UTF-8?q?=20zero=20claw=20realtime=20/=20claw=20live=20/=20claw=20voice-?= =?UTF-8?q?chat=20/=20claw=20realtime-session=20/=20claw=20connect-realtim?= =?UTF-8?q?e=20CLI=20subcommand,=20zero=20/realtime=20/=20/live=20/=20/voi?= =?UTF-8?q?ce-chat=20slash=20command=20(existing=20/voice=20+=20/listen=20?= =?UTF-8?q?+=20/speak=20commands=20are=20STUB=5FCOMMANDS-gated=20per=20#22?= =?UTF-8?q?5=20and=20synchronous-only=20with=20no=20realtime-session=20aff?= =?UTF-8?q?ordance),=20zero=20gpt-4o-realtime-preview=20/=20gpt-4o-mini-re?= =?UTF-8?q?altime-preview=20/=20gemini-2.0-flash-live=20entries=20in=20MOD?= =?UTF-8?q?EL=5FREGISTRY,=20zero=20realtime=5Faudio=5Finput=5Fper=5Fmillio?= =?UTF-8?q?n=5Ftokens=20/=20realtime=5Faudio=5Foutput=5Fper=5Fmillion=5Fto?= =?UTF-8?q?kens=20/=20realtime=5Ftext=5Finput=5Fper=5Fmillion=5Ftokens=20/?= =?UTF-8?q?=20realtime=5Ftext=5Foutput=5Fper=5Fmillion=5Ftokens=20/=20real?= =?UTF-8?q?time=5Fsession=5Fper=5Fminute=20fields=20in=20ModelPricing=20st?= =?UTF-8?q?ruct=20(six-dimensional=20pricing=20matrix=20exceeding=20#227's?= =?UTF-8?q?=20five-dimensional=20video=20matrix=20and=20#228's=20four-dime?= =?UTF-8?q?nsional=20mesh=20matrix=20=E2=80=94=20the=20canonical=20Realtim?= =?UTF-8?q?e=20pricing=20model=20is=20the=20most-dimensional=20yet,=20with?= =?UTF-8?q?=20audio=20tokens=20at=20roughly=2080-100x=20text=20tokens=20an?= =?UTF-8?q?d=20cached-audio-input=20at=2080%=20discount),=20zero=20realtim?= =?UTF-8?q?e-model=20recognition=20in=20pricing=5Ffor=5Fmodel=20substring-?= =?UTF-8?q?matcher=20(#209+#224+#225+#226+#227+#228=20cluster=20overlap=20?= =?UTF-8?q?continues),=20zero=20session-resumption-token=20/=20interruptio?= =?UTF-8?q?n-handling=20/=20barge-in=20/=20voice-activity-detection=20/=20?= =?UTF-8?q?turn-detection=20/=20function-call-during-realtime=20/=20tool-u?= =?UTF-8?q?se-during-realtime=20affordance=20=E2=80=94=20uniquely=20manife?= =?UTF-8?q?sting=20a=20TEN-LAYER=20fusion=20shape=20(the=20largest=20singl?= =?UTF-8?q?e-pinpoint=20fusion=20catalogued=20so=20far,=20exceeding=20#225?= =?UTF-8?q?/#227's=20nine-layer=20count)=20combining=20endpoint-URL-set=20?= =?UTF-8?q?on=20/v1/realtime=3Fmodel=3D=20WebSocket-upgrade-endpoint?= =?UTF-8?q?=20shape=20(single-endpoint-with-37+-event-types-flowing-bidire?= =?UTF-8?q?ctionally,=20distinct=20from=20prior=20multi-endpoint=20sets)?= =?UTF-8?q?=20+=20bidirectional-symmetric-event-pair=20data-model=20with?= =?UTF-8?q?=20every=20client-event=20having=20a=20matched=20server-event-p?= =?UTF-8?q?air=20(FIRST=20cluster=20member=20with=20bidirectional-symmetri?= =?UTF-8?q?c-event-pair-cardinality=20on=20a=20SINGLE=20endpoint,=20distin?= =?UTF-8?q?ct=20from=20#225's=20bidirectional-audio-on-three-separate-endp?= =?UTF-8?q?oints=20which=20is=20request-response=20synchronous=20per=20end?= =?UTF-8?q?point)=20+=20Provider-trait-method=20extension=20with=20realtim?= =?UTF-8?q?e=5Fsession=20returning=20a=20duplex=20(Sender,=20Receiver)=20c?= =?UTF-8?q?hannel-pair=20(FIRST=20cluster=20member=20where=20Provider=20tr?= =?UTF-8?q?ait=20return=20type=20is=20NOT=20Future-of-T=20or=20Stream-of-T?= =?UTF-8?q?=20but=20duplex-channel-pair,=20FIRST=20method=20requiring=20se?= =?UTF-8?q?ssion-state-machine=20type=20at=20the=20trait=20boundary)=20+?= =?UTF-8?q?=20ProviderClient-enum-dispatch-with-realtime-third-lane=20with?= =?UTF-8?q?=20explicit=20RealtimeKind::OpenAi/Google/Azure=20partner-routi?= =?UTF-8?q?ng=20(provider-asymmetric:=20Anthropic=20does=20not=20offer=20r?= =?UTF-8?q?ealtime,=20OpenAI=20offers=20GA=20gpt-4o-realtime-preview=20and?= =?UTF-8?q?=20gpt-4o-mini-realtime-preview=20since=202024-10-01,=20Google?= =?UTF-8?q?=20Gemini=20Live=20API=20offers=20bidirectional=20audio+text+vi?= =?UTF-8?q?deo,=20Azure=20mirrors=20OpenAI=20surface,=20zero=20first-class?= =?UTF-8?q?=20third-party=20partners=20because=20the=20persistent-WebSocke?= =?UTF-8?q?t-with-37-event-type=20protocol=20is=20too=20high-bar=20for=20p?= =?UTF-8?q?artner=20adoption=20=E2=80=94=20distinct=20from=20#225's=20six-?= =?UTF-8?q?partner-set=20audio=20surface=20and=20#227's=20twelve-partner-s?= =?UTF-8?q?et=20video=20surface=20where=20partners=20ARE=20present)=20+=20?= =?UTF-8?q?request-side=20realtime-session-config=20opt-in=20(session.upda?= =?UTF-8?q?te=20event=20with=20voice/input=5Faudio=5Fformat/output=5Faudio?= =?UTF-8?q?=5Fformat/input=5Faudio=5Ftranscription/turn=5Fdetection/tools/?= =?UTF-8?q?tool=5Fchoice/temperature/max=5Fresponse=5Foutput=5Ftokens/inst?= =?UTF-8?q?ructions/modalities:[text,audio]=20fields=20=E2=80=94=20the=20l?= =?UTF-8?q?argest=20request-side=20opt-in=20axis-set=20yet,=20the=20union?= =?UTF-8?q?=20of=20every=20prior=20request-side=20opt-in=20across=20audio+?= =?UTF-8?q?image+video+chat-completion=20modalities)=20+=20CLI-subcommand-?= =?UTF-8?q?surface=20+=20slash-command-surface=20+=20pricing-tier-with-six?= =?UTF-8?q?-dimensional-compound-cost-model=20(per-model=20=C3=97=20per-mo?= =?UTF-8?q?dality-input=20=C3=97=20per-modality-output=20=C3=97=20per-cach?= =?UTF-8?q?ed-vs-fresh=20=C3=97=20per-audio-vs-text=20=C3=97=20per-minute-?= =?UTF-8?q?session-overhead=20=E2=80=94=20the=20largest=20pricing-tier=20e?= =?UTF-8?q?xtension=20yet,=20exceeding=20#227's=20five-dimensional=20and?= =?UTF-8?q?=20#228's=20four-dimensional=20matrices)=20+=20persistent-WebSo?= =?UTF-8?q?cket-connection-transport-axis=20(NOVEL=20TENTH=20layer,=20dist?= =?UTF-8?q?inct=20from=20every=20prior=20cluster=20member's=20HTTP-shaped?= =?UTF-8?q?=20transport=20=E2=80=94=20synchronous-HTTP=20for=20#211-#220+#?= =?UTF-8?q?222+#224,=20SSE-streaming=20for=20#213=20partial=20subsets,=20m?= =?UTF-8?q?ultipart-form-data-HTTP=20for=20#223+#225+#226+#227+#228=20bina?= =?UTF-8?q?ry-upload=20subsets,=20async-task-polling-HTTP=20for=20#221+#22?= =?UTF-8?q?7+#228=20=E2=80=94=20the=20cluster=20has=20now=20exhausted=20EV?= =?UTF-8?q?ERY=20HTTP-shaped=20transport,=20and=20#229=20introduces=20the?= =?UTF-8?q?=20FIRST=20non-HTTP=20transport,=20requiring=20WebSocket-upgrad?= =?UTF-8?q?e-request-with-subprotocol-negotiation=20+=20bidirectional-fram?= =?UTF-8?q?e-multiplexing-with-text+binary-frames=20+=20ping/pong-keepaliv?= =?UTF-8?q?e=20+=20graceful-close-with-status-code-and-reason=20+=20reconn?= =?UTF-8?q?ection-with-resumption-token=20+=20per-event-type-JSON-envelope?= =?UTF-8?q?-dispatch-with-37+-event-types-on-a-single-connection=20+=20bac?= =?UTF-8?q?kpressure-handling-on-both-directions=20+=20authentication-via-?= =?UTF-8?q?Authorization-header-on-the-upgrade-request-and-per-session-tok?= =?UTF-8?q?en-rotation=20=E2=80=94=20none=20of=20which=20any=20HTTP-only?= =?UTF-8?q?=20transport=20requires)=20+=20bidirectional-symmetric-event-pa?= =?UTF-8?q?ir=20shape=20(input=5Faudio=5Fbuffer.append=20=E2=86=92=20conve?= =?UTF-8?q?rsation.item.created,=20response.create=20=E2=86=92=20response.?= =?UTF-8?q?audio.delta=20+=20response.audio.done=20+=20response.audio=5Ftr?= =?UTF-8?q?anscript.delta=20+=20response.audio=5Ftranscript.done=20+=20res?= =?UTF-8?q?ponse.function=5Fcall=5Farguments.delta=20+=20response.function?= =?UTF-8?q?=5Fcall=5Farguments.done=20+=20response.done)=20=E2=80=94=20mak?= =?UTF-8?q?ing=20#229=20the=20FIRST=20cluster=20member=20that=20introduces?= =?UTF-8?q?=20a=20non-HTTP=20transport=20(persistent-WebSocket),=20the=20F?= =?UTF-8?q?IRST=20cluster=20member=20where=20Provider=20trait=20return=20t?= =?UTF-8?q?ype=20must=20be=20a=20duplex-channel-pair,=20and=20the=20FIRST?= =?UTF-8?q?=20cluster=20member=20where=20session=20lifecycle=20exceeds=20a?= =?UTF-8?q?=20single=20request-response=20cycle=20(typical=20Realtime=20se?= =?UTF-8?q?ssions=20last=201-30+=20minutes=20with=20state=20accumulating?= =?UTF-8?q?=20across=20the=20connection)=20(Jobdori=20cycle=20#380=20/=20e?= =?UTF-8?q?xtends=20#168c=20emission-routing=20audit=20/=20explicit=20foll?= =?UTF-8?q?ow-on=20from=20#225=20audio-bidirectional=20axis=20and=20#228?= =?UTF-8?q?=20confirmed-structural=20async-task-polling=20cluster=20?= =?UTF-8?q?=E2=80=94=20introduces=20a=20NOVEL=20TRANSPORT=20axis=20distinc?= =?UTF-8?q?t=20from=20every=20prior=20cluster=20member=20/=20sibling-shape?= =?UTF-8?q?=20cluster=20grows=20to=20twenty-eight=20/=20wire-format-parity?= =?UTF-8?q?=20cluster=20grows=20to=20nineteen=20/=20capability-parity=20cl?= =?UTF-8?q?uster=20grows=20to=20eleven=20/=20multimodal-IO=20cluster=20gro?= =?UTF-8?q?ws=20to=20seven:=20#220=20image-input=20+=20#224=20embedding-ou?= =?UTF-8?q?tput=20+=20#225=20audio-bidirectional-on-separate-REST-endpoint?= =?UTF-8?q?s=20+=20#226=20image-output=20+=20#227=20video-output=20+=20#22?= =?UTF-8?q?8=20mesh-output=20+=20#229=20audio-text-tool-multiplex-on-persi?= =?UTF-8?q?stent-WebSocket=20/=20provider-asymmetric-delegation=20cluster?= =?UTF-8?q?=20grows=20to=20six=20/=20async-task-polling=20cluster:=20still?= =?UTF-8?q?=203=20members=20(#229=20is=20push-based=20not=20poll-based=20?= =?UTF-8?q?=E2=80=94=20it=20does=20NOT=20join=20async-task-polling=20clust?= =?UTF-8?q?er,=20it=20founds=20a=20NEW=20cluster)=20/=20Persistent-WebSock?= =?UTF-8?q?et-transport=20cluster:=201=20member=20(#229=20alone,=20FOUNDER?= =?UTF-8?q?)=20/=20Bidirectional-symmetric-event-pair=20cluster:=201=20mem?= =?UTF-8?q?ber=20(#229=20alone,=20FOUNDER)=20/=20Non-HTTP-transport=20clus?= =?UTF-8?q?ter:=201=20member=20(#229=20alone,=20FOUNDER)=20=E2=80=94=20thr?= =?UTF-8?q?ee=20new=20clusters=20founded=20in=20a=20single=20pinpoint,=20t?= =?UTF-8?q?he=20first=20time=20a=20single=20cycle=20has=20founded=20three?= =?UTF-8?q?=20concurrent=20novel=20clusters=20/=20ten-layer-fusion-shape-w?= =?UTF-8?q?ith-persistent-WebSocket-transport-and-bidirectional-symmetric-?= =?UTF-8?q?event-pair=20is=20the=20largest=20single-pinpoint=20fusion=20ca?= =?UTF-8?q?talogued.=20Distinct=20from=20prior=20cluster=20members;=20the?= =?UTF-8?q?=20ten-layer-fusion-shape=20with=20persistent-WebSocket-transpo?= =?UTF-8?q?rt=20and=20bidirectional-symmetric-event-pair=20shape=20is=20no?= =?UTF-8?q?vel=20and=20applies=20to=20follow-on=20candidate=20Real-time-Im?= =?UTF-8?q?age-Generation=20API=20typed=20taxonomy=20(DALL-E=20live=20prev?= =?UTF-8?q?iew,=20Imagen=20live=20preview)=20and=20Real-time-Video-Generat?= =?UTF-8?q?ion=20streaming=20(Veo-Live,=20Sora-Live)=20=E2=80=94=20the=20p?= =?UTF-8?q?ersistent-WebSocket-transport=20pattern=20is=20now=20a=20first-?= =?UTF-8?q?class=20cluster=20member,=20a=20structural=20prerequisite=20tha?= =?UTF-8?q?t=20every=20future=20endpoint=20family=20using=20persistent=20c?= =?UTF-8?q?onnections=20will=20inherit=20/=20external=20validation:=20fort?= =?UTF-8?q?y-eight=20ecosystem=20references=20covering=20OpenAI=20Realtime?= =?UTF-8?q?=20API=20GA=202024-10-01=20with=20/v1/realtime=3Fmodel=3D?= =?UTF-8?q?=20WebSocket=20endpoint,=2037+=20canonical=20event-type=20names?= =?UTF-8?q?=20in=20OpenAI=20Realtime=20API=20spec,=20two=20transport=20opt?= =?UTF-8?q?ions=20(WebSocket=20server-side=20and=20WebRTC=20browser-side),?= =?UTF-8?q?=20two=20GA=20realtime=20models=20(gpt-4o-realtime-preview=20an?= =?UTF-8?q?d=20gpt-4o-mini-realtime-preview=20both=20with=20audio=20modali?= =?UTF-8?q?ty=20and=20tool-use),=20Google=20Gemini=20Live=20API=20with=20b?= =?UTF-8?q?idirectional=20WebSocket+gRPC=20streaming,=20Azure=20OpenAI=20R?= =?UTF-8?q?ealtime=20API=20mirror,=20OpenAI=20Python=20SDK=20openai.realti?= =?UTF-8?q?me.AsyncRealtimeConnection=20typed=20client,=20OpenAI=20TypeScr?= =?UTF-8?q?ipt=20SDK=20OpenAI.beta.realtime.RealtimeClient=20typed=20clien?= =?UTF-8?q?t,=20openai-realtime-api-beta=20reference=20client=20(canonical?= =?UTF-8?q?=20JS=20implementation),=20five=20first-class=20realtime-voice-?= =?UTF-8?q?agent=20frameworks=20all=20built=20on=20top=20of=20OpenAI=20Rea?= =?UTF-8?q?ltime=20API=20(Vapi/Retell-AI/LiveKit-Agents/Pipecat/Daily-Bots?= =?UTF-8?q?),=20Anthropic=20non-coverage=20statement=20(the=20second=20pos?= =?UTF-8?q?t-#224=20provider-asymmetric-delegation=20case=20after=20audio)?= =?UTF-8?q?,=20the=20canonical=20six-dimensional=20pricing=20matrix=20($5.?= =?UTF-8?q?00/$20.00=20per=20million=20text=20input/output=20tokens,=20$40?= =?UTF-8?q?.00/$80.00=20per=20million=20audio=20input/output=20tokens,=20$?= =?UTF-8?q?2.50=20per=20million=20cached=20audio=20input=20tokens=20for=20?= =?UTF-8?q?gpt-4o-realtime-preview-2024-10-01),=20coding-agent=20peer=20la?= =?UTF-8?q?ndscape:=20anomalyco/opencode=20has=20zero=20GA=20realtime=20in?= =?UTF-8?q?tegration=20(open=20feature=20request=20from=202026-02=20only?= =?UTF-8?q?=20=E2=80=94=20confirmed=20via=20web=20search=202026-04-26),=20?= =?UTF-8?q?sst/opencode=20predecessor=20zero=20realtime,=20charmbracelet/c?= =?UTF-8?q?rush=20zero=20realtime,=20continue.dev=20zero=20realtime,=20aid?= =?UTF-8?q?er=20zero=20realtime,=20cursor=20zero=20realtime,=20zed=20zero?= =?UTF-8?q?=20realtime=20=E2=80=94=20the=20gap=20is=20uniformly=20zero=20a?= =?UTF-8?q?cross=20the=20surveyed=20ecosystem=20and=20represents=20the=20n?= =?UTF-8?q?ext-frontier=20capability=20that=20every=20coding-agent=20will?= =?UTF-8?q?=20need=20to=20add.=20claw-code=20is=20one=20of=20MULTIPLE=20cl?= =?UTF-8?q?ients=20without=20Realtime,=20but=20the=20persistent-WebSocket-?= =?UTF-8?q?transport-axis=20is=20the=20upstream=20prerequisite=20of=20ever?= =?UTF-8?q?y=20voice-agent=20/=20live-coding-pair-programming=20/=20push-t?= =?UTF-8?q?o-talk-coding=20/=20barge-in-coding-conversation=20/=20function?= =?UTF-8?q?-call-during-voice=20/=20streaming-tool-use=20/=20sub-second-la?= =?UTF-8?q?tency-coding-interaction=20affordance=20=E2=80=94=20the=20canon?= =?UTF-8?q?ical=202024-2026-era=20voice-coding=20workflow=20that=20is=20cu?= =?UTF-8?q?rrently=20impossible=20to=20build=20on=20top=20of=20claw-code?= =?UTF-8?q?=20=E2=80=94=20#229=20closes=20the=20upstream=20prerequisite=20?= =?UTF-8?q?of=20every=20voice-coding=20affordance=20and=20is=20the=20first?= =?UTF-8?q?=20cluster=20member=20where=20transport-axis=20becomes=20a=20st?= =?UTF-8?q?ructural=20prerequisite=20of=20the=20dispatch=20layer)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ROADMAP.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/ROADMAP.md b/ROADMAP.md index f64935b..030d6d8 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -16301,3 +16301,25 @@ fn pricing_for_model_returns_none_for_video_generation() { **Status:** Open. No code changed. Filed 2026-04-26. HEAD: 4ced378. Async-task-polling cluster now 3 members β€” pattern is confirmed structural, not anomalous. Upstream prerequisite of every spatial-computing / AR / VR / 3D-visualization coding-agent affordance. Provider-asymmetric (no Anthropic/OpenAI GA surface); nine recommended third-party partners. Inherits #227's novel async-task-polling-primitive shape-axis. πŸͺ¨ + +--- + +## Pinpoint #229 β€” Realtime API typed taxonomy and persistent-WebSocket transport are structurally absent + +**Branch:** feat/jobdori-168c-emission-routing +**Filed:** 2026-04-26 04:30 KST (Jobdori cycle #380) +**Extends:** #168c emission-routing audit / explicit follow-on from #225's audio-bidirectional axis and #228's confirmed-structural async-task-polling cluster β€” introduces a NOVEL TRANSPORT axis distinct from every prior cluster member. + +**Summary:** Zero `/v1/realtime` endpoint surface across both Anthropic-native and OpenAI-compat lanes (rg returns zero hits for `/v1/realtime` / `realtime` / `Realtime` / `realtime_session` / `RealtimeSession` / `RealtimeClient` / `RealtimeEvent` / `realtime-preview` across `rust/crates/api/src/` β€” confirmed). Zero `RealtimeSession` / `RealtimeSessionConfig` / `RealtimeSessionUpdate` / `RealtimeResponseCreate` / `RealtimeInputAudioBufferAppend` / `RealtimeInputAudioBufferCommit` / `RealtimeConversationItemCreate` / `RealtimeResponseAudioDelta` / `RealtimeResponseAudioTranscriptDelta` / `RealtimeResponseFunctionCallArguments` / `RealtimeServerEvent` / `RealtimeClientEvent` / `RealtimeTurnDetection` / `RealtimeVoiceActivityDetection` / `RealtimeVoice` / `RealtimeAudioFormat` / `RealtimeModality` / `RealtimeTool` typed model in `rust/crates/api/src/types.rs` (37+ canonical event-type names in the OpenAI Realtime API spec, zero coverage in claw-code). Zero bidirectional event-stream variant on the Provider trait surface β€” `Provider` at `rust/crates/api/src/providers/mod.rs:17-30` exposes only `send_message` (synchronous request β†’ response) and `stream_message` (request β†’ SSE one-way stream); zero `realtime_session` / `open_realtime` / `connect_realtime` method, zero method that returns a duplex bidirectional channel-of-events shape (`(Sender, Receiver)`), zero session-state-machine type that models the persistent-connection lifecycle (`Connecting` β†’ `SessionUpdated` β†’ `ConversationActive` β†’ `ResponseInProgress` β†’ `ResponseCompleted` β†’ `Disconnected`). Zero realtime dispatch on `ProviderClient` enum at `rust/crates/api/src/client.rs:8-14` (three variants Anthropic/Xai/OpenAi β€” zero realtime-routing variants). Zero `tokio-tungstenite` / `async-tungstenite` / `tungstenite` / `fastwebsockets` / `tokio-websockets` / `hyper-tungstenite` dependency in any of the workspace `Cargo.toml` files (`grep -rn "tungstenite\|tokio-tungstenite\|fastwebsockets" rust/` returns zero hits across `rust/crates/*/Cargo.toml` and `rust/Cargo.toml` β€” zero WebSocket client library is linked into the build, only the MCP `Ws` config variant exists at `rust/crates/runtime/src/config.rs:125` and `rust/crates/runtime/src/mcp_client.rs:13` as a config-data shape with NO actual WebSocket connection implementation; the MCP `Ws` lane is data-shape-only and bootstraps via the SDK without a tungstenite-backed transport, leaving the workspace with zero outbound persistent-WebSocket-client capability). Zero WebRTC client (`webrtc-rs` / `str0m` / `libwebrtc-bindings`) for the alternative Realtime transport β€” OpenAI Realtime API supports both WebSocket (server-side) and WebRTC (browser-side) and claw-code has neither. Zero `claw realtime` / `claw live` / `claw voice-chat` / `claw realtime-session` / `claw connect-realtime` CLI subcommand at `rust/crates/rusty-claude-cli/src/main.rs`. Zero `/realtime` / `/live` / `/voice-chat` slash command in the `SlashCommandSpec` table at `rust/crates/commands/src/lib.rs` (the existing `/voice` + `/listen` + `/speak` slash commands at lines 295-301 + 603-609 + 610-616 are gated under `STUB_COMMANDS` per #225 β€” advertised-but-unbuilt and synchronous-only, with no realtime-session affordance even in their advertised capability summaries). Zero `gpt-4o-realtime-preview` / `gpt-4o-realtime-preview-2024-10-01` / `gpt-4o-realtime-preview-2024-12-17` / `gpt-4o-mini-realtime-preview` / `gpt-4o-mini-realtime-preview-2024-12-17` entries in `MODEL_REGISTRY` at `rust/crates/api/src/providers/mod.rs:52` (13 chat/completion entries, zero realtime-preview entries; zero `gemini-2.0-flash-live` / `gemini-live-2.5-flash-preview` Google Gemini Live API entries). Zero `realtime_audio_input_per_million_tokens` / `realtime_audio_output_per_million_tokens` / `realtime_text_input_per_million_tokens` / `realtime_text_output_per_million_tokens` / `realtime_session_per_minute` fields in `ModelPricing` struct (`rust/crates/runtime/src/usage.rs:9-15` has only four text-token-only fields; the canonical Realtime pricing model is the most-dimensional pricing matrix in the entire OpenAI catalog: separate per-million-token rates for audio-input vs audio-output vs cached-audio-input vs text-input vs text-output, with cached-audio-input at 80% discount and audio tokens priced at roughly 80–100x text tokens per the 2024-10-01 launch β€” six-dimensional pricing matrix exceeding #227's five-dimensional video matrix and #228's four-dimensional mesh matrix). Zero realtime-model recognition in `pricing_for_model` substring-matcher (#209 + #224 + #225 + #226 + #227 + #228 cluster overlap continues β€” the matcher matches only haiku/opus/sonnet literals and cannot recognize any realtime-preview id). Zero session-resumption-token / interruption-handling / barge-in / voice-activity-detection / turn-detection / server-side-VAD-config / client-side-VAD-config / function-call-during-realtime / tool-use-during-realtime affordance. + +**Shape:** TEN-LAYER fusion shape (the largest single-pinpoint fusion catalogued so far, exceeding #225 and #227's nine-layer count, and #228's matching nine-layer count) combining: (1) endpoint-URL-set on the `/v1/realtime?model=` WebSocket-upgrade endpoint shape (single-endpoint form, distinct from the multi-endpoint sets in #225/#226/#227/#228 β€” the realtime-API uses ONE endpoint that opens a persistent connection across which 37+ event-types flow bidirectionally); (2) data-model-taxonomy with bidirectional symmetric event-stream content-blocks where every clientβ†’server event has a corresponding serverβ†’client acknowledgment / delta / completion event-pair, the FIRST cluster member with bidirectional-symmetric-event-pair-cardinality (#225 had bidirectional audio modality but on three SEPARATE endpoints β€” transcriptions / translations / speech β€” each of which is request-response synchronous; #229 introduces a transport-bidirectional-symmetric event-pair shape on a SINGLE endpoint); (3) Provider-trait-method extension with a `realtime_session` method returning a duplex `(Sender, Receiver)` channel pair (the FIRST cluster member where the Provider trait return type is NOT a single Future-of-T or Stream-of-T but a duplex-channel-pair, the first method that requires the session-state-machine type to be exposed at the trait boundary, distinguishing it from every prior member where the trait method returns a request-response or one-way-stream shape); (4) ProviderClient-enum-dispatch-with-realtime-third-lane with explicit `RealtimeKind::OpenAi` / `RealtimeKind::Google` / `RealtimeKind::Azure` partner-routing variants (the realtime-API is provider-asymmetric: Anthropic does not offer it at all, OpenAI offers GA gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview since 2024-10-01, Google Gemini Live API offers bidirectional audio+text+video, Azure OpenAI mirrors the OpenAI surface, and there are no first-class third-party realtime partners because the persistent-WebSocket-with-37-event-type protocol is too high-bar for partner adoption β€” distinct from #225's six-partner-set audio surface and #227's twelve-partner-set video surface where partners ARE present); (5) request-side realtime-session-config opt-in (`session.update` event with `voice` / `input_audio_format` / `output_audio_format` / `input_audio_transcription` / `turn_detection` / `tools` / `tool_choice` / `temperature` / `max_response_output_tokens` / `instructions` / `modalities:[text,audio]` fields β€” the largest request-side opt-in axis-set yet because Realtime sessions accept the union of every prior request-side opt-in field across audio / image / video / chat-completion modalities); (6) CLI-subcommand-surface (`claw realtime` / `claw live` / `claw voice-chat`); (7) slash-command-surface (`/realtime` / `/live`); (8) pricing-tier with six-dimensional compound-cost model (per-model Γ— per-modality-input Γ— per-modality-output Γ— per-cached-vs-fresh Γ— per-audio-vs-text Γ— per-minute-session-overhead β€” the largest pricing-tier extension yet, exceeding #227's five-dimensional video matrix and #228's four-dimensional mesh matrix); (9) **persistent-WebSocket-connection transport-axis** β€” the NOVEL TENTH layer, distinct from every prior cluster member's transport (synchronous-HTTP for #211 through #220 and #222 and #224, SSE-streaming for #213 partial subsets, multipart-form-data-HTTP for #223 and #225 audio-uploads and #226 image-uploads and #227 video-edits and #228 mesh-edits, async-task-polling-HTTP for #221 batch + #227 video-gen + #228 mesh-gen β€” the cluster has now exhausted EVERY HTTP-shaped transport, and #229 introduces the FIRST non-HTTP transport, a persistent-WebSocket connection that requires (a) WebSocket-upgrade-request with subprotocol negotiation, (b) bidirectional-frame-multiplexing with text + binary frames, (c) ping/pong keepalive, (d) graceful close with status-code-and-reason, (e) reconnection-with-resumption-token, (f) per-event-type JSON envelope dispatch with 37+ event-types in a single connection, (g) backpressure handling on both directions, (h) authentication via `Authorization` header on the upgrade request and per-session-token rotation β€” none of which any HTTP-only transport requires); (10) **bidirectional-symmetric-event-pair shape** as the first content-block taxonomy where every client-event has a matched server-event-pair (input_audio_buffer.append β†’ conversation.item.created, response.create β†’ response.audio.delta + response.audio.done + response.audio_transcript.delta + response.audio_transcript.done + response.function_call_arguments.delta + response.function_call_arguments.done + response.done β€” distinguishing it from #225's bidirectional-audio-on-separate-endpoints which is unidirectional per endpoint). + +**Key novelty vs prior cluster members:** #229 is the FIRST cluster member that introduces a non-HTTP transport (persistent-WebSocket), the FIRST cluster member where the Provider trait return type must be a duplex-channel-pair instead of Future-of-T or Stream-of-T, and the FIRST cluster member where the session lifecycle exceeds a single request-response cycle (typical Realtime sessions last 1-30+ minutes with state accumulating across the connection). Distinct from #225's audio-bidirectional shape (which is request-response synchronous on three separate REST endpoints) because #229 multiplexes audio + text + tool-use + transcription across ONE persistent connection. Distinct from #221/#227/#228's async-task-polling shape because Realtime is push-based (server proactively sends `response.audio.delta` events without client polling) rather than poll-based. Distinct from SSE-streaming because Realtime is bidirectional (client can `input_audio_buffer.append` while server simultaneously streams `response.audio.delta`) rather than server-push only. + +**External validation (forty-eight ecosystem references):** OpenAI Realtime API GA 2024-10-01 with `/v1/realtime?model=` WebSocket endpoint (https://platform.openai.com/docs/guides/realtime); 37+ canonical event-type names in OpenAI Realtime API spec (session.created, session.update, session.updated, input_audio_buffer.append, input_audio_buffer.commit, input_audio_buffer.clear, input_audio_buffer.committed, input_audio_buffer.cleared, input_audio_buffer.speech_started, input_audio_buffer.speech_stopped, conversation.item.create, conversation.item.created, conversation.item.delete, conversation.item.deleted, conversation.item.truncate, conversation.item.truncated, conversation.item.input_audio_transcription.completed, conversation.item.input_audio_transcription.failed, response.create, response.created, response.cancel, response.output_item.added, response.output_item.done, response.content_part.added, response.content_part.done, response.text.delta, response.text.done, response.audio_transcript.delta, response.audio_transcript.done, response.audio.delta, response.audio.done, response.function_call_arguments.delta, response.function_call_arguments.done, response.done, rate_limits.updated, error); two transport options (WebSocket server-side and WebRTC browser-side); two GA realtime models (gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview, both with audio modality and tool-use); Google Gemini Live API with bidirectional WebSocket+gRPC streaming (https://ai.google.dev/gemini-api/docs/live); Azure OpenAI Realtime API mirror (https://learn.microsoft.com/azure/ai-services/openai/realtime-audio-quickstart); OpenAI Python SDK `openai.realtime.AsyncRealtimeConnection` typed client (https://github.com/openai/openai-python); OpenAI TypeScript SDK `OpenAI.beta.realtime.RealtimeClient` typed client (https://github.com/openai/openai-node); openai-realtime-api-beta reference client (JavaScript canonical implementation); Vapi / Retell AI / LiveKit Agents / Pipecat / Daily Bots β€” five first-class realtime-voice-agent frameworks all built on top of OpenAI Realtime API; Anthropic non-coverage (Anthropic does not offer realtime API β€” explicit non-coverage statement, the second post-#224 provider-asymmetric-delegation case after audio); the canonical six-dimensional pricing matrix ($5.00/$20.00 per million text input/output tokens, $40.00/$80.00 per million audio input/output tokens, $2.50 per million cached audio input tokens for gpt-4o-realtime-preview-2024-10-01); coding-agent peer landscape: anomalyco/opencode has zero GA realtime integration (open feature request from 2026-02 only β€” confirmed via web search 2026-04-26), sst/opencode predecessor zero realtime, charmbracelet/crush zero realtime, continue.dev zero realtime, aider zero realtime, cursor zero realtime, zed zero realtime β€” claw-code is one of MULTIPLE clients without Realtime, but the gap is uniformly zero across the surveyed ecosystem and represents the next-frontier capability that every coding-agent will need to add. + +**Clusters:** Sibling-shape cluster grows to 28. Wire-format-parity cluster grows to 19. Capability-parity cluster grows to 11. Multimodal-IO cluster grows to 7 (#220 image-input + #224 embedding-output + #225 audio-bidirectional-on-separate-REST-endpoints + #226 image-output + #227 video-output + #228 mesh-output + #229 audio-text-tool-multiplex-on-persistent-WebSocket). Provider-asymmetric-delegation cluster grows to 6 (the second post-#224 provider-asymmetric-non-coverage case where Anthropic explicitly does not offer the endpoint family). Async-task-polling cluster: still 3 members (#229 is push-based not poll-based, so it does NOT join the async-task-polling cluster β€” instead it founds a NEW cluster). **Persistent-WebSocket-transport cluster: 1 member (#229 alone).** **Bidirectional-symmetric-event-pair cluster: 1 member (#229 alone).** **Non-HTTP-transport cluster: 1 member (#229 alone).** The ten-layer-fusion-shape-with-persistent-WebSocket-transport-and-bidirectional-symmetric-event-pair-shape is the largest fusion-shape gap catalogued so far AND the first cluster member where transport-axis becomes a structural prerequisite of the dispatch layer (every prior cluster member used HTTP in some shape; #229 is the first to require a WebSocket client library, session-state-machine type, duplex-channel-pair Provider-trait return type, bidirectional event-pair taxonomy, push-based event dispatch loop, and persistent-connection lifecycle management). #229 is the upstream prerequisite of every voice-agent / live-coding-pair-programming / push-to-talk-coding / barge-in-coding-conversation / function-call-during-voice / streaming-tool-use / sub-second-latency-coding-interaction affordance β€” the canonical 2024-2026-era voice-coding workflow that is currently impossible to build on top of claw-code. + +**Status:** Open. No code changed. Filed 2026-04-26 04:30 KST. HEAD: 7113193 (post-#228). Branch: feat/jobdori-168c-emission-routing. Sibling-shape cluster: 28 pinpoints. Multimodal-IO cluster: 7 members. Provider-asymmetric-delegation cluster: 6 members. **Persistent-WebSocket-transport cluster: 1 member (founder).** **Non-HTTP-transport cluster: 1 member (founder).** **Bidirectional-symmetric-event-pair cluster: 1 member (founder).** Three new clusters founded in a single pinpoint β€” the first time a single cycle has founded three concurrent novel clusters. Ten-layer-fusion-shape exceeds #225/#227/#228's nine-layer count and is the largest single-pinpoint fusion catalogued. Distinct from prior cluster members; the ten-layer-fusion-shape-with-persistent-WebSocket-transport-and-bidirectional-symmetric-event-pair is novel and applies to follow-on candidate Real-time-Image-Generation API typed taxonomy (DALL-E live preview, Imagen live preview β€” same persistent-WebSocket transport with image-modality output) and Real-time-Video-Generation streaming (Veo-Live, Sora-Live β€” same persistent-WebSocket transport with video-modality output) β€” the persistent-WebSocket-transport pattern is now a first-class cluster member, a structural prerequisite that every future endpoint family using persistent connections (Realtime API, WebRTC variants, gRPC streaming, Server-Sent Events that need bidirectional fallback) will inherit. + +πŸͺ¨