vox / Docs
llms.txt

Provider Protocol

How external STT and TTS engines plug into Vox via JSON-RPC over stdin/stdout.

Provider Protocol

Vox separates the runtime (mic capture, sessions, routing, telemetry, playback handoff) from the speech engine. Engines are called providers. They can be external processes or built-in bridges that speak JSON-RPC over stdin/stdout.

Provider configuration plugs directly into the companion runtime’s install, preload, and route-dispatch flow. Read Runtime alongside this spec if you want the full daemon-side picture.

Providers can serve either:

  • ASR / STT: accept audio and return text
  • TTS: accept text and return audio

Built-in providers include:

  • parakeet for ASR
  • avspeech for system TTS
  • openai-tts for remote TTS
  • elevenlabs for ElevenLabs remote TTS
  • minimax for MiniMax remote TTS
  • mlx-audio for built-in external bridging across both ASR and TTS

Provider Config

Providers are registered in ~/.vox/providers.json:

{
  "providers": [
    {
      "id": "parakeet",
      "kind": "asr",
      "builtin": true,
      "models": ["parakeet:v3"]
    },
    {
      "id": "avspeech",
      "kind": "tts",
      "builtin": true,
      "models": ["avspeech:system"]
    },
    {
      "id": "openai-tts",
      "kind": "tts",
      "builtin": true,
      "models": ["gpt-4o-mini-tts"],
      "env": {
        "OPENAI_API_KEY": "sk-...",
        "VOX_OPENAI_TTS_TIMEOUT_SECONDS": "12"
      }
    },
    {
      "id": "elevenlabs",
      "kind": "tts",
      "builtin": true,
      "models": ["eleven_multilingual_v2"],
      "env": {
        "ELEVENLABS_API_KEY": "..."
      }
    },
    {
      "id": "minimax",
      "kind": "tts",
      "builtin": true,
      "models": ["speech-2.8-hd"],
      "env": {
        "MINIMAX_API_KEY": "..."
      }
    },
    {
      "id": "mlx-audio",
      "kind": "asr",
      "builtin": true,
      "env": {
        "VOX_MLX_AUDIO_PYTHON": "/path/to/venv/bin/python",
        "VOX_MLX_AUDIO_ASR_MODELS": "mlx-community/whisper-large-v3-turbo-asr-fp16,mlx-community/Qwen3-ASR-0.6B-8bit"
      }
    },
    {
      "id": "mlx-audio",
      "kind": "tts",
      "builtin": true,
      "env": {
        "VOX_MLX_AUDIO_PYTHON": "/path/to/venv/bin/python",
        "VOX_MLX_AUDIO_TTS_MODELS": "mlx-community/Soprano-1.1-80M-bf16,mlx-community/Kokoro-82M-4bit",
        "VOX_MLX_AUDIO_TTS_DEFAULT_VOICE": "af_heart"
      }
    }
  ]
}
FieldTypeRequiredDescription
idstringYesUnique identifier for this provider.
kind"asr" | "tts"NoProvider kind. Defaults to asr if omitted.
builtinbooleanNoIf true, Vox uses its bundled implementation for the given id.
commandstring[]NoExecutable and arguments Vox will spawn for an external provider.
modelsstring[]NoModel IDs this provider serves. Optional when the provider reports models dynamically.
envRecord<string, string>NoExtra environment variables passed to the provider process.

Notes:

  • Register ASR and TTS as separate entries even when they share the same id.
  • models is optional for external providers now. Vox can call models() and route dynamically from the returned list.
  • Built-in remote TTS providers read their API keys from env first, then from process environment. ELEVENLABS_BASE_URL, ELEVENLABS_OUTPUT_FORMAT, and MINIMAX_BASE_URL can override vendor defaults.
  • If providers.json contains only ASR entries, Vox falls back to default TTS providers. The inverse is also true.

OpenAI TTS timeout

openai-tts uses a hard wall-clock request timeout so stalled remote TTS calls do not block the caller for minutes.

  • default: 12 seconds
  • env override: VOX_OPENAI_TTS_TIMEOUT_SECONDS
  • compatibility alias: OPENAI_TTS_TIMEOUT_SECONDS
  • maximum accepted value: 30 seconds

The timeout can be set in the provider env block or the daemon process environment.

Protocol Methods

All communication uses newline-delimited JSON-RPC 2.0 over stdin (requests from Vox) and stdout (responses from the provider).

models

List available models for the provider kind.

Request:

{ "jsonrpc": "2.0", "id": 1, "method": "models" }

Response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "models": [
      {
        "id": "mlx-community/whisper-large-v3-turbo-asr-fp16",
        "name": "whisper-large-v3-turbo-asr-fp16",
        "backend": "mlx-audio",
        "installed": true,
        "preloaded": false,
        "available": true
      }
    ]
  }
}

install

Download or prepare model files.

Request:

{ "jsonrpc": "2.0", "id": 2, "method": "install", "params": { "modelId": "mlx-community/Kokoro-82M-4bit" } }

The provider can emit progress notifications on stdout during installation or preload:

{ "jsonrpc": "2.0", "method": "progress", "params": { "modelId": "mlx-community/Kokoro-82M-4bit", "progress": 0.5, "status": "loading" } }

Response: a model info object matching the shape returned by models.

preload

Load a model into memory so subsequent requests start faster.

Request:

{ "jsonrpc": "2.0", "id": 3, "method": "preload", "params": { "modelId": "mlx-community/Soprano-1.1-80M-bf16" } }

Response: a model info object with preloaded: true.

ASR Methods

transcribe

Transcribe an audio file.

Request:

{ "jsonrpc": "2.0", "id": 4, "method": "transcribe", "params": { "modelId": "mlx-community/whisper-large-v3-turbo-asr-fp16", "path": "/tmp/audio.wav" } }

Response:

{
  "jsonrpc": "2.0",
  "id": 4,
  "result": {
    "modelId": "mlx-community/whisper-large-v3-turbo-asr-fp16",
    "text": "Hello world",
    "elapsedMs": 142,
    "metrics": {
      "inferenceMs": 130,
      "modelLoadMs": 0,
      "audioLoadMs": 5,
      "audioPrepareMs": 2,
      "fileCheckMs": 1,
      "modelCheckMs": 1,
      "totalMs": 142
    },
    "words": [
      { "word": "Hello", "start": 0.12, "end": 0.44, "confidence": 0.99 },
      { "word": "world", "start": 0.45, "end": 0.71, "confidence": 0.98 }
    ]
  }
}

TTS Methods

voices

List available voices for a model. If modelId is omitted, Vox may call voices across multiple models and merge the results.

Request:

{ "jsonrpc": "2.0", "id": 5, "method": "voices", "params": { "modelId": "mlx-community/Kokoro-82M-4bit" } }

Response:

{
  "jsonrpc": "2.0",
  "id": 5,
  "result": {
    "voices": [
      {
        "id": "af_heart",
        "name": "af_heart",
        "language": "en-US",
        "backend": "mlx-audio",
        "modelId": "mlx-community/Kokoro-82M-4bit",
        "available": true,
        "default": true
      }
    ]
  }
}

synthesize

Generate audio from text.

Request:

{
  "jsonrpc": "2.0",
  "id": 6,
  "method": "synthesize",
  "params": {
    "modelId": "mlx-community/Soprano-1.1-80M-bf16",
    "input": "Hello from Vox",
    "voiceId": "af_heart",
    "format": "wav",
    "speed": 1.0
  }
}

Response:

{
  "jsonrpc": "2.0",
  "id": 6,
  "result": {
    "modelId": "mlx-community/Soprano-1.1-80M-bf16",
    "voiceId": "af_heart",
    "format": "wav",
    "contentType": "audio/wav",
    "audioBase64": "<base64 wav data>",
    "elapsedMs": 418,
    "metrics": {
      "audioDurationMs": 1024,
      "characterCount": 14,
      "modelCheckMs": 0,
      "modelLoadMs": 0,
      "voiceResolveMs": 1,
      "synthesisMs": 363,
      "totalMs": 418
    }
  }
}

Metrics Contract

Providers must return stage timings in the metrics object of every transcribe or synthesize response. These feed into Vox telemetry tagged with modelId, route, and, for TTS, voiceId.

ASR metrics

Required fields:

FieldTypeDescription
inferenceMsnumberTime spent running the model.
totalMsnumberWall-clock time for the entire request.

Optional but recommended:

FieldTypeDescription
modelLoadMsnumberTime loading the model (0 if already preloaded).
audioLoadMsnumberTime reading the audio file from disk.
audioPrepareMsnumberTime resampling or converting the audio.
fileCheckMsnumberTime validating the audio file exists and is readable.
modelCheckMsnumberTime checking the model is installed and ready.
audioDurationMsnumberDuration of the input audio.

TTS metrics

Required fields:

FieldTypeDescription
totalMsnumberWall-clock time for the entire request.

Optional but recommended:

FieldTypeDescription
synthesisMsnumberTime spent generating audio once the model is running.
inferenceMsnumberAccepted as a fallback alias for synthesisMs.
modelLoadMsnumberTime loading the model (0 if already preloaded).
modelCheckMsnumberTime checking the model is installed and ready.
voiceResolveMsnumberTime resolving the requested voice.
audioDurationMsnumberDuration of the synthesized audio.
outputBytesnumberNumber of encoded output bytes.
characterCountnumberLength of the input text.

What Vox handles

Providers only deal with models plus transcription or synthesis. The runtime handles everything else:

  • Mic permissions and capture — ASR providers receive a WAV file path
  • Audio format normalization for ASR input
  • Playback handoff — TTS providers return audio bytes and Vox hands them back to the caller
  • Session lifecycle — start, stop, cancel coordinated by the daemon
  • Warm-up scheduling and state
  • Client identity routing (clientId)
  • Performance telemetry collection
  • Provider execution capacity and backpressure — requests are not globally serialized by default

Provider execution model

Provider calls are asynchronous work items. Vox must not treat correctness as “only one provider request can exist at a time.”

TTS providers, especially remote API-backed providers, should support concurrent synthesize calls. A client may submit many independent utterances and await their results independently. Playback ordering is a caller concern, not a provider-execution constraint.

ASR has more physical-resource constraints because microphone capture may involve one input device, permissions, and ownership. That constraint belongs to capture/session coordination, not to the provider protocol itself. File transcription and provider inference can still be concurrent when the selected backend has capacity.

Capacity should be explicit:

  • providers may advertise or be configured with max concurrency
  • Vox may apply per-provider or per-model backpressure when capacity is exhausted
  • backpressure should return a typed busy/capacity error or queue metadata, not silently impose a global mutex
  • telemetry should distinguish provider execution time from queue/wait time when queueing exists

Writing a provider

A provider is any executable that reads newline-delimited JSON-RPC from stdin and writes responses to stdout. Minimal TypeScript example:

// minimal-provider.ts
import { createInterface } from "readline";

const rl = createInterface({ input: process.stdin });

for await (const line of rl) {
  const req = JSON.parse(line);

  if (req.method === "models") {
    respond(req.id, {
      models: [
        {
          id: "my-model:v1",
          name: "My Model",
          backend: "custom",
          installed: true,
          preloaded: false,
          available: true,
        },
      ],
    });
  }

  if (req.method === "transcribe") {
    const text = await myTranscribe(req.params.path);
    respond(req.id, {
      modelId: req.params.modelId,
      text,
      elapsedMs: 100,
      metrics: { inferenceMs: 95, totalMs: 100 },
    });
  }

  if (req.method === "voices") {
    respond(req.id, {
      voices: [
        {
          id: "default",
          name: "Default",
          backend: "custom-tts",
          modelId: req.params?.modelId ?? "my-tts:v1",
          available: true,
          default: true,
        },
      ],
    });
  }

  if (req.method === "synthesize") {
    const audioBase64 = await mySynthesize(req.params.input);
    respond(req.id, {
      modelId: req.params.modelId,
      voiceId: req.params.voiceId ?? "default",
      format: "wav",
      contentType: "audio/wav",
      audioBase64,
      elapsedMs: 120,
      metrics: { synthesisMs: 110, totalMs: 120 },
    });
  }
}

function respond(id: number, result: unknown) {
  process.stdout.write(JSON.stringify({ jsonrpc: "2.0", id, result }) + "\n");
}

Register it in ~/.vox/providers.json:

{
  "providers": [
    {
      "id": "my-provider",
      "kind": "asr",
      "command": ["bun", "run", "minimal-provider.ts"],
      "models": ["my-model:v1"]
    },
    {
      "id": "my-tts",
      "kind": "tts",
      "command": ["bun", "run", "minimal-provider.ts"],
      "models": ["my-tts:v1"]
    }
  ]
}

Then select it via CLI or SDK by specifying the target model ID.

Provider lifecycle

Vox spawns the provider process on first use. It stays alive for the daemon’s lifetime. If it crashes, Vox restarts it on the next request.

Providers should be stateless between requests. The provider process can keep model weights in memory, but Vox assumes nothing about that state — a crash and restart must not break anything.

Search

Find docs fast