Testing Commands¶
Jarvis provides three levels of testing for commands: E2E command parsing tests, multi-turn conversation tests, and direct unit tests. Use all three for comprehensive coverage.
E2E Command Parsing Tests¶
test_command_parsing.py tests the "front half" of the pipeline: given a voice command, does the LLM select the right command and extract the right parameters?
Prerequisites¶
- Register a dev node (see jarvis-node-setup CLAUDE.md)
- Start required services:
# Command center (port 7703)
cd jarvis-command-center && ./run-docker-dev.sh
# LLM proxy (port 7704)
cd jarvis-llm-proxy-api && ./run.sh
Running Tests¶
cd jarvis-node-setup
# Run all tests
python test_command_parsing.py
# List all available tests with indices
python test_command_parsing.py -l
# Run specific tests by index
python test_command_parsing.py -t 5 7 11
# Run all tests for specific commands
python test_command_parsing.py -c calculate get_weather
# Custom output file
python test_command_parsing.py -o my_results.json
Test Structure¶
Each test is a CommandTest with a voice command, expected command name, and expected parameters:
class CommandTest:
def __init__(
self,
voice_command: str, # "What's 5 plus 3?"
expected_command: str, # "calculate"
expected_params: dict, # {"num1": 5, "num2": 3, "operation": "add"}
description: str, # Human-readable test description
ha_context: dict | None, # Optional Home Assistant context
):
Adding Tests for Your Command¶
Add test cases to the test list in test_command_parsing.py:
CommandTest(
voice_command="Roll a d20",
expected_command="roll_dice",
expected_params={"sides": 20},
description="Roll a 20-sided die",
),
CommandTest(
voice_command="Roll 3 twelve-sided dice",
expected_command="roll_dice",
expected_params={"sides": 12, "count": 3},
description="Roll multiple dice with custom sides",
),
Understanding Results¶
Results are written to test_results.json (or your custom path). The output includes:
{
"summary": {
"total": 50,
"passed": 47,
"failed": 3,
"success_rate": 94.0,
"avg_response_time_ms": 850
},
"results": [
{
"voice_command": "Roll a d20",
"expected_command": "roll_dice",
"actual_command": "roll_dice",
"expected_params": {"sides": 20},
"actual_params": {"sides": 20},
"passed": true,
"response_time_ms": 720
}
],
"analysis": {
"command_success_rates": {
"calculate": {"total": 5, "passed": 5, "rate": 100.0},
"get_weather": {"total": 8, "passed": 7, "rate": 87.5}
},
"confusion_matrix": {
"get_weather -> search_web": 1
}
}
}
Key sections:
- summary -- overall pass/fail counts and success rate
- results -- per-test details with expected vs. actual
- analysis -- command-level success rates and confusion matrix showing which commands get confused with which
Interpreting Failures¶
| Failure Type | Meaning | Fix |
|---|---|---|
| Wrong command selected | LLM confused this with another command | Add antipatterns, improve description, add more examples |
| Wrong parameters | Right command, wrong param values | Add more prompt_examples for this pattern, add critical_rules |
| Missing parameters | Required param not extracted | Check description on the parameter, add examples showing this param |
| Extra parameters | LLM hallucinated a parameter | Add rules telling the LLM what NOT to include |
Multi-Turn Conversation Tests¶
test_multi_turn_conversation.py tests the "back half": tool execution, validation flows, context preservation, and result incorporation.
Prerequisites¶
Same as E2E parsing tests, plus optionally:
# For full audio pipeline mode:
cd jarvis-tts && ./run-docker-dev.sh # TTS (port 7707)
cd jarvis-whisper-api && ./run-dev.sh # Whisper (port 7706)
Running Tests¶
cd jarvis-node-setup
# Fast mode (text-based, no audio)
python test_multi_turn_conversation.py
# Full mode (TTS -> Whisper -> Command Center)
python test_multi_turn_conversation.py --full
# List all tests
python test_multi_turn_conversation.py -l
# Run specific category
python test_multi_turn_conversation.py -c validation
# Run specific tests by index
python test_multi_turn_conversation.py -t 0 1 2
# Save audio artifacts (full mode only)
python test_multi_turn_conversation.py --full -t 0 1 2 --save-audio ./audio_artifacts/
Test Categories¶
| Category | Tests |
|---|---|
tool_execution |
Single-turn tool execution (happy path) |
validation |
Validation and clarification flows |
result_incorporation |
Tool results appear in final spoken response |
context |
Context preservation across turns |
error_handling |
Graceful error handling |
complex |
Complex queries (knowledge, unit conversions) |
Test Structure¶
Multi-turn tests define a sequence of conversation turns:
@dataclass
class Turn:
voice_command: str | None # What the user says (None for continuation)
expected_stop_reason: StopReason # tool_calls, validation_required, or complete
expected_tool: str | None # Expected tool to be called
expected_params: dict | None # Expected parameters (subset match)
validation_response: str | None # Response if validation is needed
@dataclass
class MultiTurnTest:
description: str
turns: list[Turn]
category: str
verify_response_contains: str | None # Check final response text
Example Multi-Turn Test¶
MultiTurnTest(
description="Calculator with follow-up",
category="context",
turns=[
Turn(
voice_command="What's 5 plus 3?",
expected_stop_reason=StopReason.TOOL_CALLS,
expected_tool="calculate",
expected_params={"num1": 5, "num2": 3, "operation": "add"},
),
Turn(
voice_command="Now multiply that by 2",
expected_stop_reason=StopReason.TOOL_CALLS,
expected_tool="calculate",
expected_params={"num1": 8, "num2": 2, "operation": "multiply"},
),
],
verify_response_contains="16",
)
Fast vs. Full Mode¶
| Aspect | Fast Mode | Full Mode |
|---|---|---|
| Input | Text sent directly | TTS generates audio, Whisper transcribes |
| Speed | ~1s per turn | ~5-10s per turn |
| Coverage | Command parsing + execution | Full audio pipeline |
| Services needed | CC + LLM proxy | CC + LLM proxy + TTS + Whisper |
| Use when | Developing/iterating | Final verification before deploy |
Unit Testing run() Directly¶
For fast iteration on command logic, test run() directly without involving the command center or LLM.
Basic Pattern¶
import pytest
from commands.dice_command import DiceCommand
from core.request_information import RequestInformation
@pytest.fixture
def cmd():
return DiceCommand()
@pytest.fixture
def request_info():
return RequestInformation(
voice_command="test",
conversation_id="test-conv-001",
)
def test_basic_roll(cmd, request_info):
response = cmd.run(request_info, sides=6, count=1)
assert response.success
assert len(response.context_data["rolls"]) == 1
assert 1 <= response.context_data["rolls"][0] <= 6
def test_multiple_dice(cmd, request_info):
response = cmd.run(request_info, sides=20, count=3)
assert response.success
assert len(response.context_data["rolls"]) == 3
assert response.context_data["total"] == sum(response.context_data["rolls"])
def test_invalid_sides(cmd, request_info):
response = cmd.run(request_info, sides=1, count=1)
assert not response.success
assert "at least 2 sides" in response.error_details
def test_too_many_dice(cmd, request_info):
response = cmd.run(request_info, sides=6, count=101)
assert not response.success
assert "between 1 and 100" in response.error_details
Testing the Full Execute Pipeline¶
To test validation, use execute() instead of run():
def test_execute_with_missing_required_param(cmd, request_info):
"""execute() should raise ValueError for missing required params"""
# If your command has required params, omitting them should fail
with pytest.raises(ValueError, match="Missing required params"):
cmd.execute(request_info) # No kwargs provided
def test_execute_with_invalid_enum(cmd, request_info):
"""execute() should return validation_error for invalid enum values"""
# For a command with enum parameters like calculator
calc = CalculatorCommand()
response = calc.execute(request_info, num1=5, num2=3, operation="invalid_op")
assert not response.success
assert response.context_data.get("_validation_error")
Testing Pre-Route¶
def test_pre_route_matches(cmd):
result = cmd.pre_route("pause")
assert result is not None
assert result.arguments == {"action": "pause"}
def test_pre_route_falls_through(cmd):
result = cmd.pre_route("play some jazz music in the living room")
assert result is None # Too complex, falls through to LLM
Testing Post-Process¶
def test_post_process_fixes_missing_query(cmd):
args = {"action": "play"}
result = cmd.post_process_tool_call(args, "Play some jazz")
assert result["query"] == "jazz"
def test_post_process_preserves_existing_query(cmd):
args = {"action": "play", "query": "Beatles"}
result = cmd.post_process_tool_call(args, "Play Beatles")
assert result["query"] == "Beatles" # Unchanged
Testing Handle Action¶
def test_handle_send_action(cmd, request_info):
context = {
"draft": {"to": "[email protected]", "subject": "Hi", "body": "Hello!"}
}
response = cmd.handle_action("send_click", context)
assert response.success
def test_handle_cancel_action(cmd, request_info):
response = cmd.handle_action("cancel_click", {})
assert response.success
assert response.context_data["cancelled"]
def test_handle_unknown_action(cmd, request_info):
response = cmd.handle_action("unknown_action", {})
assert not response.success
Mocking Secrets¶
For commands that depend on secrets, mock get_secret_value:
from unittest.mock import patch
@patch("services.secret_service.get_secret_value")
def test_run_with_api_key(mock_secret, cmd, request_info):
mock_secret.side_effect = lambda key, scope: {
("FINANCE_API_KEY", "integration"): "test-key-123",
("FINANCE_DEFAULT_CURRENCY", "integration"): "USD",
}.get((key, scope))
response = cmd.run(request_info, ticker="AAPL")
assert response.success
Mocking HTTP Requests¶
from unittest.mock import patch, MagicMock
@patch("httpx.get")
@patch("services.secret_service.get_secret_value", return_value="test-key")
def test_api_timeout(mock_secret, mock_get, cmd, request_info):
mock_get.side_effect = httpx.TimeoutException("timeout")
response = cmd.run(request_info, ticker="AAPL")
assert not response.success
assert "not responding" in response.error_details
Running Tests¶
cd jarvis-node-setup
# Run all unit tests
pytest
# Run tests for a specific command
pytest tests/test_dice_command.py
# Run with coverage
pytest --cov=commands --cov-report=html
# Run with verbose output
pytest -v tests/test_dice_command.py
Test Strategy Summary¶
| Level | What It Tests | Speed | Services Needed |
|---|---|---|---|
Unit tests (pytest) |
run() logic, validation, pre_route |
Fast (ms) | None |
E2E parsing (test_command_parsing.py) |
LLM command selection + parameter extraction | Medium (1-2s/test) | CC + LLM proxy |
Multi-turn fast (test_multi_turn_conversation.py) |
Execution flow, validation, context | Medium (1-2s/turn) | CC + LLM proxy |
Multi-turn full (--full) |
Complete audio pipeline | Slow (5-10s/turn) | CC + LLM + TTS + Whisper |
Recommended workflow:
- Write unit tests first (TDD)
- Add E2E parsing tests for your command
- Add multi-turn tests for complex flows
- Run full mode before deploying to production nodes