Skip to content

Whisper API

The Whisper API provides speech-to-text transcription via whisper.cpp. It supports speaker identification through voice profile enrollment, returning both the transcribed text and a speaker confidence score.

Quick Reference

Port 7706
Health endpoint GET /health
Source jarvis-whisper-api/
Framework FastAPI + Uvicorn
Backend whisper.cpp
Tier 3 - Specialized

API Endpoints

Method Path Description
GET /ping Simple liveness probe
GET /health Health check
POST /transcribe Transcribe audio to text
POST /api/v0/voice-profiles/enroll Enroll a voice profile for speaker ID
GET /api/v0/voice-profiles List enrolled voice profiles
DELETE /api/v0/voice-profiles/{id} Delete a voice profile
* /settings/* Settings CRUD (see Settings Server)

Transcription

The /transcribe endpoint accepts a WAV file upload. Optional query parameters tune accuracy and preprocessing:

Query Parameter Default Description
prompt Initial text to guide transcription style
preprocess false Apply audio normalization and silence trimming before transcription
temperature 0.0 Initial sampling temperature (0.0–1.0). 0 = greedy decoding
temperature_inc 0.2 Temperature increment on decode failure
beam_size 5 Beam size for beam search (1–16)

The response always includes a speaker field. Speaker identification is only populated when USE_VOICE_RECOGNITION=true:

{
  "text": "what's the weather like",
  "speaker": {
    "user_id": 1,
    "confidence": 0.87
  }
}

When speaker recognition is disabled, user_id is null and confidence is 0.0.

The command center uses the speaker result to resolve display names and load user-specific memories.

Speaker Identification

Voice profiles are enrolled as WAV samples. The service uses resemblyzer for speaker embedding comparison.

Latency

Resemblyzer on CPU adds 5–9 s per transcription request. On CUDA it is under 1 s. Use the GPU Docker image (docker-compose.gpu.yaml) if speaker ID latency matters.

Environment Variables

Variable Description
PORT API port (default 7706)
WHISPER_MODEL Path to the whisper.cpp GGML model file (e.g. ~/whisper.cpp/models/ggml-base.en.bin)
WHISPER_CLI Path to the whisper-cli binary — auto-detected from PATH if unset
WHISPER_ENABLE_CUDA Enable CUDA acceleration for the whisper.cpp build (false)
USE_VOICE_RECOGNITION Enable speaker identification via resemblyzer (false)
JARVIS_VOICE_DEVICE Device for the speaker-recognition encoder: auto, cpu, or cuda (default auto)
JARVIS_APP_ID App identity for service-to-service auth (default jarvis-whisper)
JARVIS_APP_KEY App key for service-to-service auth
JARVIS_AUTH_BASE_URL Auth service URL for validating incoming requests
JARVIS_AUTH_CACHE_SUCCESS_TTL Auth validation success cache TTL in seconds (default 300)
JARVIS_AUTH_CACHE_FAILURE_TTL Auth validation failure cache TTL in seconds (default 60)
JARVIS_LOG_CONSOLE_LEVEL Console log level (default INFO)
JARVIS_LOG_REMOTE_LEVEL Remote log level sent to jarvis-logs (default DEBUG)

Dependencies

  • whisper.cpp -- speech-to-text engine
  • resemblyzer -- speaker identification (optional, install via pip install .[speaker])
  • jarvis-auth -- validates node and app credentials
  • jarvis-logs -- structured logging

Dependents

  • jarvis-command-center -- sends audio for transcription

Impact if Down

Speech-to-text transcription is unavailable. If the command center uses server-side transcription (rather than on-device), voice input cannot be processed. Text-based commands still work.