Skip to content

Benchmarks

LLM benchmarks for Jarvis are tracked in the docs/benchmarks/ directory of the main repository. These benchmarks measure command parsing accuracy, latency, and memory usage across different models and backends.

What Is Measured

  • Command parsing accuracy -- Does the LLM correctly identify the intended command and extract the right parameters from a voice transcription?
  • Latency -- Time from receiving the transcription to returning a tool call (inference time).
  • Memory usage -- VRAM and RAM consumption for different model sizes and quantization levels.

Tested Configurations

Benchmarks cover multiple dimensions:

Dimension Variants
Models Qwen 2.5 (7B, 14B), Qwen 3 (14B, 32B), Llama 3.1 8B, Hermes 3 8B, others
Backends MLX (macOS Metal), GGUF (llama.cpp), vLLM (CUDA)
Quantization Q4_K_M, Q6_K, FP16
Adapters Base model vs LoRA fine-tuned

Where to Find Them

Benchmark results and comparison tables are maintained in:

docs/benchmarks/

Each benchmark run records the model, backend, quantization level, hardware, test suite version, and per-command accuracy breakdown.

Running Benchmarks

Use the E2E command parsing test suite to generate benchmark data:

cd jarvis-node-setup

# Run all command parsing tests
python test_command_parsing.py -o benchmark_results.json

# Run for specific commands
python test_command_parsing.py -c calculate get_weather send_email -o benchmark_results.json

Results include per-command success rates, average response times, and a confusion matrix showing which commands get misclassified.