Voice Pipeline¶
The voice pipeline is the core interaction flow -- from microphone to speaker.
Sequence¶
sequenceDiagram
participant Node as Pi Zero Node
participant CC as Command Center
participant W as Whisper API
participant LLM as LLM Proxy
participant CMD as Command (Node)
participant TTS as TTS Service
Node->>Node: Wake word detected (Porcupine)
Node->>Node: Record audio until silence
Node->>CC: POST /api/v0/command (audio)
CC->>W: POST /transcribe (audio)
W-->>CC: {text, speaker: {user_id, confidence}}
CC->>CC: Resolve speaker → display name
CC->>CC: Load user memories
CC->>CC: Build system prompt + tool schemas
CC->>LLM: POST /v1/chat/completions
LLM-->>CC: Tool call: calculate(num1=5, num2=3, op="add")
CC->>CMD: execute(request_info, num1=5, num2=3, operation="add")
CMD-->>CC: CommandResponse {result: 8}
CC->>LLM: Tool result → generate spoken response
LLM-->>CC: "5 plus 3 equals 8"
CC->>TTS: POST /synthesize (text)
TTS-->>Node: Audio (via MQTT or direct)
Node->>Node: Play audio
Pipeline Stages¶
1. Wake Word Detection¶
Local on the Pi Zero using Porcupine. No audio leaves the device until the wake word is detected. This is a core privacy guarantee -- the node only starts recording after hearing the configured wake word.
2. Speech-to-Text (Whisper)¶
Audio is sent to jarvis-whisper-api which runs whisper.cpp. Returns transcription text plus optional speaker identification with a confidence score.
Key files:
- Node:
stt_providers/jarvis_whisper_client.py(TranscriptionResultwith speaker data) - Whisper:
app/api/voice_profiles.py(enrollment endpoints)
3. Speaker Resolution¶
Command Center resolves the speaker's user_id to a display name via jarvis-auth. Names are cached for 5 minutes to avoid repeated lookups.
Key file: jarvis-command-center/app/core/utils/speaker_resolver.py
4. Memory Injection¶
User-specific memories are loaded from PostgreSQL and injected into the system prompt. The LLM sees context like:
About Alex: likes black coffee, morning person
Users can manage memories through voice commands ("remember that I like black coffee") or the REST API.
Key files:
jarvis-command-center/app/services/memory_service.py(memory CRUD + prompt formatting)jarvis-command-center/app/core/tools/remember_tool.py(remember tool)jarvis-command-center/app/core/tools/forget_tool.py(forget tool)jarvis-command-center/app/api/memories.py(REST CRUD API)
5. Intent Classification (LLM)¶
The LLM receives the transcribed text along with all registered command schemas (tool definitions). It selects the appropriate command and extracts parameters.
The command center builds a system prompt containing:
- Current date/time context
- Speaker identity and memories
- All available tool schemas (command definitions with parameter types)
The LLM responds with a tool call specifying the command name and extracted arguments.
6. Command Execution¶
The selected command's execute() method runs. Commands implement the IJarvisCommand interface:
Commands validate secrets and parameters, then call run() with the extracted arguments.
7. Response Generation¶
The command's result (CommandResponse) is sent back to the LLM to generate a natural language spoken response. This second LLM call turns structured data into conversational speech.
8. Text-to-Speech¶
The spoken response text is sent to jarvis-tts which uses Piper TTS. Audio is delivered to the node via MQTT (Mosquitto broker) or direct HTTP response.
Pre-Routing (Fast Path)¶
Commands can implement pre_route() to claim short, unambiguous utterances without LLM inference:
def pre_route(self, voice_command: str) -> PreRouteResult | None:
if voice_command.strip().lower() == "pause":
return PreRouteResult(arguments={}, spoken_response="Paused.")
return None
This skips steps 5-7 entirely, reducing latency to near-zero for simple commands like "pause", "stop", or "nevermind".
Speaker ID and Memory Flow¶
Node (mic) --> Whisper --> {text, speaker: {user_id, confidence}}
|
v
Command Center receives transcription
|
+-- Extracts speaker_user_id from whisper response
+-- Resolves user_id -> display name via jarvis-auth (cached 5min)
+-- Loads user memories from PostgreSQL (MemoryService)
+-- Injects speaker name + memories into system prompt
|
+--> LLM sees: "About Alex: - Likes black coffee - Morning person"
LLM can call: remember({content: "..."}) / forget({content_match: "..."})
Performance Target¶
Total end-to-end latency target: < 5 seconds including:
- Whisper transcription
- Date context extraction
- Command inference (tool routing)
- Command execution and response