All posts

Why I Built Vox

I was building a macOS app that needed voice input. Mic permissions, audio capture, model loading, keeping things warm so the first command doesn't lag. I got it working. Then I needed the same thing in a Raycast extension. Then a browser extension. Each time I was solving the same problems — and none of them had anything to do with what I was actually trying to build.

So I pulled the runtime out into its own thing. That's Vox.

The interesting part isn't the runtime

There's already great transcription out there. Whisper, Parakeet, Deepgram — the model layer is genuinely good. What's less fun is everything around it. Getting mic permissions right on macOS. Keeping a model loaded so you don't pay cold-start on every request. Making sure two apps don't load separate copies of the same 600MB model. Measuring where time actually goes.

These are solved problems individually, but everyone solves them again from scratch. Vox just bundles them into a daemon that stays running. A Swift service handles the audio engine and model lifecycle. A TypeScript SDK connects over local WebSocket. A Bun CLI gives you tools to measure and poke at things. It ships with Parakeet running on-device via CoreML.

Why open source it

Honestly, I'd just use it myself either way. But it seemed like the kind of thing that might save someone else a few weekends. If you're building a voice feature on macOS — a dictation tool, an editor plugin, something with Raycast — the runtime part shouldn't be the hard part. The hard part should be whatever you're actually making.

The internals are intentionally visible. Warm-up is a public API. Stage timings come back with every transcription. Client identity flows through the telemetry so you can tell which integration is slow. I built it this way because I needed to debug my own stuff, and it turns out that's useful for anyone building on top of it.

Where it's going

Vox ships with one model today, but the architecture supports plugging in others. There's a provider protocol — any executable that reads audio and writes text over stdin/stdout can be a transcription engine. So if Parakeet isn't right for your use case, you can bring Whisper or whatever else without touching the runtime.

I don't have grand plans for this. If people find it useful, great. If not, I'll keep using it for my own projects. The code is open source and the docs are up.

Get started

bash
git clone https://github.com/arach/vox.git && cd vox
bun install && bun run build
vox daemon start && vox doctor

Four commands, working transcription. The docs cover the rest.