vox / Docs
llms.txt

Provider Protocol

How external STT and TTS engines plug into Vox via JSON-RPC over stdin/stdout.

Provider Protocol

Vox separates the runtime (mic capture, sessions, routing, telemetry, playback handoff) from the speech engine. Engines are called providers. They can be external processes or built-in bridges that speak JSON-RPC over stdin/stdout.

Provider configuration plugs directly into the companion runtime’s install, preload, and route-dispatch flow. Read Runtime alongside this spec if you want the full daemon-side picture.

Providers can serve either:

  • ASR / STT: accept audio and return text
  • TTS: accept text and return audio

Built-in providers include:

  • parakeet for ASR
  • avspeech for system TTS
  • openai-tts for remote TTS
  • mlx-audio for built-in external bridging across both ASR and TTS

Provider Config

Providers are registered in ~/.vox/providers.json:

{
  "providers": [
    {
      "id": "parakeet",
      "kind": "asr",
      "builtin": true,
      "models": ["parakeet:v3"]
    },
    {
      "id": "avspeech",
      "kind": "tts",
      "builtin": true,
      "models": ["avspeech:system"]
    },
    {
      "id": "mlx-audio",
      "kind": "asr",
      "builtin": true,
      "env": {
        "VOX_MLX_AUDIO_PYTHON": "/path/to/venv/bin/python",
        "VOX_MLX_AUDIO_ASR_MODELS": "mlx-community/whisper-large-v3-turbo-asr-fp16,mlx-community/Qwen3-ASR-0.6B-8bit"
      }
    },
    {
      "id": "mlx-audio",
      "kind": "tts",
      "builtin": true,
      "env": {
        "VOX_MLX_AUDIO_PYTHON": "/path/to/venv/bin/python",
        "VOX_MLX_AUDIO_TTS_MODELS": "mlx-community/Soprano-1.1-80M-bf16,mlx-community/Kokoro-82M-4bit",
        "VOX_MLX_AUDIO_TTS_DEFAULT_VOICE": "af_heart"
      }
    }
  ]
}
FieldTypeRequiredDescription
idstringYesUnique identifier for this provider.
kind"asr" | "tts"NoProvider kind. Defaults to asr if omitted.
builtinbooleanNoIf true, Vox uses its bundled implementation for the given id.
commandstring[]NoExecutable and arguments Vox will spawn for an external provider.
modelsstring[]NoModel IDs this provider serves. Optional when the provider reports models dynamically.
envRecord<string, string>NoExtra environment variables passed to the provider process.

Notes:

  • Register ASR and TTS as separate entries even when they share the same id.
  • models is optional for external providers now. Vox can call models() and route dynamically from the returned list.
  • If providers.json contains only ASR entries, Vox falls back to default TTS providers. The inverse is also true.

Protocol Methods

All communication uses newline-delimited JSON-RPC 2.0 over stdin (requests from Vox) and stdout (responses from the provider).

models

List available models for the provider kind.

Request:

{ "jsonrpc": "2.0", "id": 1, "method": "models" }

Response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "models": [
      {
        "id": "mlx-community/whisper-large-v3-turbo-asr-fp16",
        "name": "whisper-large-v3-turbo-asr-fp16",
        "backend": "mlx-audio",
        "installed": true,
        "preloaded": false,
        "available": true
      }
    ]
  }
}

install

Download or prepare model files.

Request:

{ "jsonrpc": "2.0", "id": 2, "method": "install", "params": { "modelId": "mlx-community/Kokoro-82M-4bit" } }

The provider can emit progress notifications on stdout during installation or preload:

{ "jsonrpc": "2.0", "method": "progress", "params": { "modelId": "mlx-community/Kokoro-82M-4bit", "progress": 0.5, "status": "loading" } }

Response: a model info object matching the shape returned by models.

preload

Load a model into memory so subsequent requests start faster.

Request:

{ "jsonrpc": "2.0", "id": 3, "method": "preload", "params": { "modelId": "mlx-community/Soprano-1.1-80M-bf16" } }

Response: a model info object with preloaded: true.

ASR Methods

transcribe

Transcribe an audio file.

Request:

{ "jsonrpc": "2.0", "id": 4, "method": "transcribe", "params": { "modelId": "mlx-community/whisper-large-v3-turbo-asr-fp16", "path": "/tmp/audio.wav" } }

Response:

{
  "jsonrpc": "2.0",
  "id": 4,
  "result": {
    "modelId": "mlx-community/whisper-large-v3-turbo-asr-fp16",
    "text": "Hello world",
    "elapsedMs": 142,
    "metrics": {
      "inferenceMs": 130,
      "modelLoadMs": 0,
      "audioLoadMs": 5,
      "audioPrepareMs": 2,
      "fileCheckMs": 1,
      "modelCheckMs": 1,
      "totalMs": 142
    },
    "words": [
      { "word": "Hello", "start": 0.12, "end": 0.44, "confidence": 0.99 },
      { "word": "world", "start": 0.45, "end": 0.71, "confidence": 0.98 }
    ]
  }
}

TTS Methods

voices

List available voices for a model. If modelId is omitted, Vox may call voices across multiple models and merge the results.

Request:

{ "jsonrpc": "2.0", "id": 5, "method": "voices", "params": { "modelId": "mlx-community/Kokoro-82M-4bit" } }

Response:

{
  "jsonrpc": "2.0",
  "id": 5,
  "result": {
    "voices": [
      {
        "id": "af_heart",
        "name": "af_heart",
        "language": "en-US",
        "backend": "mlx-audio",
        "modelId": "mlx-community/Kokoro-82M-4bit",
        "available": true,
        "default": true
      }
    ]
  }
}

synthesize

Generate audio from text.

Request:

{
  "jsonrpc": "2.0",
  "id": 6,
  "method": "synthesize",
  "params": {
    "modelId": "mlx-community/Soprano-1.1-80M-bf16",
    "input": "Hello from Vox",
    "voiceId": "af_heart",
    "format": "wav",
    "speed": 1.0
  }
}

Response:

{
  "jsonrpc": "2.0",
  "id": 6,
  "result": {
    "modelId": "mlx-community/Soprano-1.1-80M-bf16",
    "voiceId": "af_heart",
    "format": "wav",
    "contentType": "audio/wav",
    "audioBase64": "<base64 wav data>",
    "elapsedMs": 418,
    "metrics": {
      "audioDurationMs": 1024,
      "characterCount": 14,
      "modelCheckMs": 0,
      "modelLoadMs": 0,
      "voiceResolveMs": 1,
      "synthesisMs": 363,
      "totalMs": 418
    }
  }
}

Metrics Contract

Providers must return stage timings in the metrics object of every transcribe or synthesize response. These feed into Vox telemetry tagged with modelId, route, and, for TTS, voiceId.

ASR metrics

Required fields:

FieldTypeDescription
inferenceMsnumberTime spent running the model.
totalMsnumberWall-clock time for the entire request.

Optional but recommended:

FieldTypeDescription
modelLoadMsnumberTime loading the model (0 if already preloaded).
audioLoadMsnumberTime reading the audio file from disk.
audioPrepareMsnumberTime resampling or converting the audio.
fileCheckMsnumberTime validating the audio file exists and is readable.
modelCheckMsnumberTime checking the model is installed and ready.
audioDurationMsnumberDuration of the input audio.

TTS metrics

Required fields:

FieldTypeDescription
totalMsnumberWall-clock time for the entire request.

Optional but recommended:

FieldTypeDescription
synthesisMsnumberTime spent generating audio once the model is running.
inferenceMsnumberAccepted as a fallback alias for synthesisMs.
modelLoadMsnumberTime loading the model (0 if already preloaded).
modelCheckMsnumberTime checking the model is installed and ready.
voiceResolveMsnumberTime resolving the requested voice.
audioDurationMsnumberDuration of the synthesized audio.
outputBytesnumberNumber of encoded output bytes.
characterCountnumberLength of the input text.

What Vox handles

Providers only deal with models plus transcription or synthesis. The runtime handles everything else:

  • Mic permissions and capture — ASR providers receive a WAV file path
  • Audio format normalization for ASR input
  • Playback handoff — TTS providers return audio bytes and Vox hands them back to the caller
  • Session lifecycle — start, stop, cancel coordinated by the daemon
  • Warm-up scheduling and state
  • Client identity routing (clientId)
  • Performance telemetry collection
  • Multi-client serialization — providers see one request at a time

Writing a provider

A provider is any executable that reads newline-delimited JSON-RPC from stdin and writes responses to stdout. Minimal TypeScript example:

// minimal-provider.ts
import { createInterface } from "readline";

const rl = createInterface({ input: process.stdin });

for await (const line of rl) {
  const req = JSON.parse(line);

  if (req.method === "models") {
    respond(req.id, {
      models: [
        {
          id: "my-model:v1",
          name: "My Model",
          backend: "custom",
          installed: true,
          preloaded: false,
          available: true,
        },
      ],
    });
  }

  if (req.method === "transcribe") {
    const text = await myTranscribe(req.params.path);
    respond(req.id, {
      modelId: req.params.modelId,
      text,
      elapsedMs: 100,
      metrics: { inferenceMs: 95, totalMs: 100 },
    });
  }

  if (req.method === "voices") {
    respond(req.id, {
      voices: [
        {
          id: "default",
          name: "Default",
          backend: "custom-tts",
          modelId: req.params?.modelId ?? "my-tts:v1",
          available: true,
          default: true,
        },
      ],
    });
  }

  if (req.method === "synthesize") {
    const audioBase64 = await mySynthesize(req.params.input);
    respond(req.id, {
      modelId: req.params.modelId,
      voiceId: req.params.voiceId ?? "default",
      format: "wav",
      contentType: "audio/wav",
      audioBase64,
      elapsedMs: 120,
      metrics: { synthesisMs: 110, totalMs: 120 },
    });
  }
}

function respond(id: number, result: unknown) {
  process.stdout.write(JSON.stringify({ jsonrpc: "2.0", id, result }) + "\n");
}

Register it in ~/.vox/providers.json:

{
  "providers": [
    {
      "id": "my-provider",
      "kind": "asr",
      "command": ["bun", "run", "minimal-provider.ts"],
      "models": ["my-model:v1"]
    },
    {
      "id": "my-tts",
      "kind": "tts",
      "command": ["bun", "run", "minimal-provider.ts"],
      "models": ["my-tts:v1"]
    }
  ]
}

Then select it via CLI or SDK by specifying the target model ID.

Provider lifecycle

Vox spawns the provider process on first use. It stays alive for the daemon’s lifetime. If it crashes, Vox restarts it on the next request.

Providers should be stateless between requests. The provider process can keep model weights in memory, but Vox assumes nothing about that state — a crash and restart must not break anything.

Search

Find docs fast