TTS¶
The TTS (text-to-speech) service converts text responses into audio and streams them to nodes over HTTP. It supports multiple synthesis backends through a provider abstraction: Piper (default, baked-in, robotic but fast) and Kokoro (82M-param Apache 2.0 model, natural prosody, weights downloaded on first use). The active provider is selected at runtime via a setting.
The service can also generate contextual wake word responses by calling the LLM proxy.
Quick Reference¶
| Port | 7707 |
| Health endpoint | GET /health |
| Source | jarvis-tts/ |
| Framework | FastAPI + Uvicorn |
| Providers | Piper (default), Kokoro |
| Tier | 3 — Specialized |
API Endpoints¶
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/ping |
— | Simple liveness probe |
GET |
/health |
— | Health check |
GET |
/audio/format |
app | Current provider's audio format (sample rate, width, channels) |
POST |
/speak |
app | Synthesize text, return full WAV audio |
POST |
/speak/stream |
app | Stream raw 16-bit PCM as it is synthesized (low latency). Format metadata in X-Audio-* response headers |
POST |
/generate-wake-response |
app | Generate a charming wake-word greeting via the LLM proxy |
* |
/settings/* |
— | Settings CRUD (see Settings Server) |
/speak/stream is the preferred endpoint for nodes — the node's play_pcm_stream() reads the X-Audio-Sample-Rate, X-Audio-Channels, and X-Audio-Sample-Width headers and pipes the raw PCM to aplay, so the node works with any provider regardless of sample rate.
Provider Selection¶
| Setting | Type | Default | Description |
|---|---|---|---|
tts.provider |
string | kokoro |
Active backend: piper or kokoro |
tts.default_voice |
string | en_GB-alan-low |
Piper ONNX voice file name (looked up in app/models/) |
tts.kokoro_voice |
string | bm_george |
Kokoro voice ID. Notable British male options: bm_george, bm_fable, bm_daniel, bm_lewis |
tts.kokoro_speed |
float | 1.25 |
Kokoro speech speed multiplier (validated natural-sounding default) |
Changes take effect within ~60 seconds (the settings service has a 60s cache). No container restart is required — the provider is rebuilt lazily on the next synthesis request. If the newly selected provider fails to load, the service logs a warning and falls back to Piper so voice responses never break.
Comparing providers¶
| Piper | Kokoro | |
|---|---|---|
| Install | Baked into image | Optional (pip install .[kokoro]) |
| Model weights | ~15 MB, in image | ~300 MB, downloaded to HF_HOME on first use |
| Hardware | CPU only | CPU (~2–3× realtime) or GPU (fast) |
| Latency | Very low | ~550 ms time-to-first-audio |
| Quality | Robotic but intelligible | Natural prosody, especially on long text |
| Voice selection | Per-model files | Built-in multilingual catalog (see VOICES.md in hexgrad/kokoro) |
Model Caching¶
Kokoro weights download lazily via huggingface_hub on first use. The Docker image mounts a named volume at HF_HOME=/app/models/hf_cache so weights persist across container restarts — otherwise each cold start re-downloads ~300 MB. The installer (jarvis-admin) and the docker-compose.*.yaml files in the service declare this volume (jarvis-tts-hf-cache).
Environment Variables¶
| Variable | Description |
|---|---|
TTS_PORT |
API port (default 7707) |
TTS_PROVIDER |
Initial provider selection (also settable via tts.provider) |
HF_HOME |
Cache dir for Kokoro voice weights (default /app/models/hf_cache) |
JARVIS_AUTH_BASE_URL |
Auth service URL |
JARVIS_APP_ID |
App identity for app-to-app auth (default jarvis-tts) |
JARVIS_APP_KEY |
App key for app-to-app auth |
JARVIS_LLM_PROXY_API_URL |
LLM proxy URL (for wake responses) |
JARVIS_CONFIG_URL |
Config service URL (for discovery) |
NODE_AUTH_CACHE_TTL |
Auth validation cache TTL (seconds) |
Dependencies¶
- Piper TTS — default backend, baked into the image
- Kokoro TTS — optional backend, installed via the
kokoroextra - jarvis-auth — validates node and app credentials
- jarvis-logs — structured logging
- jarvis-llm-proxy-api — wake-response generation (optional)
- jarvis-config-service — service discovery (optional)
Dependents¶
- jarvis-node-setup — Pi Zero nodes call
/speak/streamfor voice responses - jarvis-command-center — requests TTS for voice responses
Impact if Down¶
No voice responses from Jarvis. Nodes receive text-only responses (if the client supports it). Wake-word acknowledgment audio is unavailable. The service is not on the critical path — command processing continues to work; only audible output is lost.