Skip to content

Voice overview

The realtime voice stack is a peer runtime to the Orchestrator — not an agent. It owns a persistent WebSocket, drives its own turn loop, and routes audio directly to playback. You instantiate a RealtimeRuntime, hand it a VoiceAssistant and audio I/O, call start(), and observe events.

TypeRole
RealtimeRuntimeWires a VoiceAssistant to AudioInput/AudioOutput; the object the app holds
VoiceAssistantProtocol — start, stop, send audio/text, interrupt, events stream
OpenAIVoiceAssistantSingle-LLM implementation — calls tools then speaks the answer
OpenAIGroundedVoiceAssistantGatherer→presenter — tools first, then isolated presenter speaks
RealtimeTransportFrame channel protocol; URLSessionWebSocketTransport is the live implementation
AudioInput / AudioOutputPCM16 capture and playback seams — AVFoundation implementations in AgentSquadAudio

VoiceAssistant is the contract every voice session satisfies. RealtimeRuntime consumes it; your app does not call it directly — use RealtimeRuntime’s surface instead.

public protocol VoiceAssistant: Sendable {
var modality: RealtimeModality { get }
var events: AsyncStream<RealtimeEvent> { get }
func start() async throws
func sendAudio(_ pcm16: Data) async
func sendText(_ text: String) async
func interrupt() async
func stop() async
}
  • sendAudio expects PCM16 at 24 kHz mono chunks — exactly what AudioInput.frames produces.
  • sendText triggers a typed turn (no VAD); the reply is text-only in OpenAIVoiceAssistant.
  • interrupt is an explicit barge-in: flushes playback and cancels the in-flight server response without tearing down the connection.

RealtimeRuntime is the object an app instantiates. It wires the session to AudioInput/AudioOutput, re-broadcasts the session’s events, and handles audio routing including barge-in flushes.

public actor RealtimeRuntime {
public nonisolated let events: AsyncStream<RealtimeEvent>
public nonisolated var modality: RealtimeModality { get }
public init(
session: any VoiceAssistant,
input: any AudioInput,
output: any AudioOutput
)
public func start() async throws
public func sendText(_ text: String) async
public func interrupt() async
public func stop() async
}

start() starts the output engine, then the session, then the mic — in that order — and launches two internal tasks: one pumping the session’s events to playback and re-broadcasting them, and one forwarding mic frames to the session. stop() tears down in producer-first order to avoid in-flight sends against an already-stopped session.

interrupt() flushes AudioOutput immediately for minimum latency before awaiting the server round-trip; the session’s own interrupt() then emits .audioDone(interrupted: true), which causes a second flush — harmless and idempotent.

let transport = URLSessionWebSocketTransport(
apiKey: "sk-..."
)
let assistant = OpenAIVoiceAssistant(
name: "voice-assistant",
transport: transport,
tools: myToolProvider,
userId: "u1",
sessionId: UUID().uuidString
)
// AudioInput / AudioOutput from AgentSquadAudio:
let runtime = RealtimeRuntime(
session: assistant,
input: MicCapture(),
output: AudioPlayback()
)
try await runtime.start()
for await event in runtime.events {
switch event {
case .state(let phase): updateUI(phase)
case .userTranscript(let text, final: true): showTranscript(text)
case .presenterText(let text, final: true): showReply(text)
case .error(let msg): print("error:", msg)
default: break
}
}

Controls what the session produces and consumes.

public struct RealtimeModality: Sendable, Equatable {
public enum Input: Sendable { case speech, text }
public enum Output: Sendable { case audio, text, audioAndText }
public let input: Input
public let output: Output
public init(input: Input = .speech, output: Output = .audio)
}

The default RealtimeModality() is speech in / audio out. output: .audioAndText makes the session emit both .audio frames and .presenterText deltas in parallel.


The non-throwing event stream. All events arrive on RealtimeRuntime.events (which re-broadcasts the session’s own stream).

public enum RealtimeEvent: Sendable {
case state(RealtimePhase)
case userTranscript(String, final: Bool) // STT delta or final
case presenterText(String, final: Bool) // spoken reply as text (audio+text or text mode)
case widget(UIPayload) // MCP Apps UI payload for this turn
case audio(Data) // PCM16 @ 24 kHz — drain promptly
case audioDone(interrupted: Bool) // flush playback: barge-in or natural end
case error(String)
}
  • userTranscript streams incrementally (final: false) then fires once more with final: true. presenterText mirrors that pattern.
  • audio frames arrive continuously while the model speaks — RealtimeRuntime routes them to AudioOutput automatically; an app observing events directly must drain them promptly.

public enum RealtimePhase: String, Sendable {
case idle // not started
case ready // connected, awaiting typed input (text-input mode)
case listening // capturing the user's voice
case thinking // agent turn: calling tools
case presenting // grounded presenter speaking from curated data
case speaking // direct (no-tool) reply
}

When store is provided at init, start() replays prior turns from ChatStorage as conversation items before the pump starts handling inbound frames. Each completed turn (user transcript + spoken reply) is saved under slugify(name). See Storage overview.


sendText (called on RealtimeRuntime) marks the turn as text-only: the assistant’s tool→continue loop stays in text, and .state(.speaking) is never emitted. Useful for non-voice UIs sharing the same session.


Each session is one trace: a voice.session root span opened on the first turn, with a voice.turn child per turn. Tool calls appear as tool.<name> children of the turn. Token usage (including per-modality audio token breakdown) is attached to the generation span inside each turn. Set traceTranscripts: false to keep spoken content off the trace while span structure and token counts still flow. See Tracing overview.


AudioInput and AudioOutput are protocol seams so the runtime is testable without hardware. The concrete AVFoundation implementations (MicCapture, AudioPlayback) live in the AgentSquadAudio module.

public protocol AudioInput: Sendable {
var frames: AsyncStream<Data> { get } // PCM16 @ 24 kHz mono, bounded drop-oldest
func start() async throws
func stop() async
}
public protocol AudioOutput: Sendable {
func start() async throws
func enqueue(_ pcm16: Data) async // queue one PCM16 frame for playback
func flush() async // drop all queued/playing audio (barge-in cut)
func stop() async
}

frames is a bounded drop-oldest stream so a slow consumer never blocks the audio capture thread. flush() is the barge-in cut — it drops both queued and currently-playing audio immediately.


Grounded vs. standard — when to use each

Section titled “Grounded vs. standard — when to use each”
OpenAIVoiceAssistantOpenAIGroundedVoiceAssistant
Turn structureSingle response: tools then speakGatherer (tools, silent) → presenter (speaks)
Hallucination riskStandard — model can mix tool data with priorsLow — presenter sees only curated tool output
LatencyLower (one response)Higher (two responses per tool-using turn)
Direct (no-tool) turnsModel speaks directlyModel speaks directly (directInstructions)
Phase events.thinking, .speaking.thinking, .presenting, .speaking

Use OpenAIGroundedVoiceAssistant when factual accuracy matters and the answer derives from tool data. Use OpenAIVoiceAssistant for low-latency assistants where the model’s parametric knowledge is acceptable.