Skip to content

TTS

The TTS (text-to-speech) service converts text responses into audio and streams them to nodes over HTTP. It supports multiple synthesis backends through a provider abstraction: Piper (default, baked-in, robotic but fast) and Kokoro (82M-param Apache 2.0 model, natural prosody, weights downloaded on first use). The active provider is selected at runtime via a setting.

The service can also generate contextual wake word responses by calling the LLM proxy.

Quick Reference

Port 7707
Health endpoint GET /health
Source jarvis-tts/
Framework FastAPI + Uvicorn
Providers Piper (default), Kokoro
Tier 3 — Specialized

API Endpoints

Method Path Auth Description
GET /ping Simple liveness probe
GET /health Health check
GET /audio/format app Current provider's audio format (sample rate, width, channels)
POST /speak app Synthesize text, return full WAV audio
POST /speak/stream app Stream raw 16-bit PCM as it is synthesized (low latency). Format metadata in X-Audio-* response headers
POST /generate-wake-response app Generate a charming wake-word greeting via the LLM proxy
* /settings/* Settings CRUD (see Settings Server)

/speak/stream is the preferred endpoint for nodes — the node's play_pcm_stream() reads the X-Audio-Sample-Rate, X-Audio-Channels, and X-Audio-Sample-Width headers and pipes the raw PCM to aplay, so the node works with any provider regardless of sample rate.

Provider Selection

Setting Type Default Description
tts.provider string kokoro Active backend: piper or kokoro
tts.default_voice string en_GB-alan-low Piper ONNX voice file name (looked up in app/models/)
tts.kokoro_voice string bm_george Kokoro voice ID. Notable British male options: bm_george, bm_fable, bm_daniel, bm_lewis
tts.kokoro_speed float 1.25 Kokoro speech speed multiplier (validated natural-sounding default)

Changes take effect within ~60 seconds (the settings service has a 60s cache). No container restart is required — the provider is rebuilt lazily on the next synthesis request. If the newly selected provider fails to load, the service logs a warning and falls back to Piper so voice responses never break.

Comparing providers

Piper Kokoro
Install Baked into image Optional (pip install .[kokoro])
Model weights ~15 MB, in image ~300 MB, downloaded to HF_HOME on first use
Hardware CPU only CPU (~2–3× realtime) or GPU (fast)
Latency Very low ~550 ms time-to-first-audio
Quality Robotic but intelligible Natural prosody, especially on long text
Voice selection Per-model files Built-in multilingual catalog (see VOICES.md in hexgrad/kokoro)

Model Caching

Kokoro weights download lazily via huggingface_hub on first use. The Docker image mounts a named volume at HF_HOME=/app/models/hf_cache so weights persist across container restarts — otherwise each cold start re-downloads ~300 MB. The installer (jarvis-admin) and the docker-compose.*.yaml files in the service declare this volume (jarvis-tts-hf-cache).

Environment Variables

Variable Description
TTS_PORT API port (default 7707)
TTS_PROVIDER Initial provider selection (also settable via tts.provider)
HF_HOME Cache dir for Kokoro voice weights (default /app/models/hf_cache)
JARVIS_AUTH_BASE_URL Auth service URL
JARVIS_APP_ID App identity for app-to-app auth (default jarvis-tts)
JARVIS_APP_KEY App key for app-to-app auth
JARVIS_LLM_PROXY_API_URL LLM proxy URL (for wake responses)
JARVIS_CONFIG_URL Config service URL (for discovery)
NODE_AUTH_CACHE_TTL Auth validation cache TTL (seconds)

Dependencies

  • Piper TTS — default backend, baked into the image
  • Kokoro TTS — optional backend, installed via the kokoro extra
  • jarvis-auth — validates node and app credentials
  • jarvis-logs — structured logging
  • jarvis-llm-proxy-api — wake-response generation (optional)
  • jarvis-config-service — service discovery (optional)

Dependents

  • jarvis-node-setup — Pi Zero nodes call /speak/stream for voice responses
  • jarvis-command-center — requests TTS for voice responses

Impact if Down

No voice responses from Jarvis. Nodes receive text-only responses (if the client supports it). Wake-word acknowledgment audio is unavailable. The service is not on the critical path — command processing continues to work; only audible output is lost.