Voice Mode for Molly

Phase Б.3 — release 0.7.7. Talk to Molly hands-free. Push-to-talk in the chat panel, the Cmd+Space hotkey for instant capture, and an optional on-prem speech engine for nodes that want zero-cloud voice.

What you get

Push-to-talk button in the Molly chat panel, with a state-coloured ring (red while listening, blue while transcribing, green when the transcript is ready to send).
Cmd+Space hotkey (configurable) toggles capture without moving your hands off the keyboard.
Browser-side STT/TTS by default — uses the Web Speech API in Chrome / Edge / Safari. Free, zero install, zero server resources.
Optional on-prem engines — install whisper.cpp (≈75 MB, English) and Piper (≈60 MB, en_US-amy-medium) for fully local-only voice. Lazy download; engines are NOT installed unless the user opts in.
Mobile path — when a paired phone is connected through the §0.3 pair tunnel, voice frames travel as opaque AEAD frames and the OS-side transcription engine returns the text inline.

Privacy

No always-listening / wake-word. Push-to-talk only. Browsers show their normal mic indicator while capture is active.
Recordings are dropped after transcription. The OS processes the WAV bytes in memory and never persists them.
Browser STT runs in-browser — speech does not leave the device on the default path. Native engines run locally on the OS host. There is no cloud STT in this release.
Mobile path is end-to-end encrypted through the existing X25519 + XChaCha20-Poly1305 pair tunnel; the relay only sees ciphertext.

Settings

Settings → Molly → Voice exposes:

STT engine — Auto (default), Browser (Web Speech API), or Native (whisper.cpp).
TTS engine — Off, Browser (SpeechSynthesis), or Piper (on-prem).
PTT hotkey — defaults to Cmd+Space. Edit the spec inline; modifiers are Cmd, Ctrl, Alt, Shift, joined with +.
Input device — picker populated from navigator.mediaDevices.enumerateDevices().
Install Whisper / Install Piper buttons. 🔒 Pro+ — see Orbit Pro.

How it works

Frontend state machine

useVoice runs the push-to-talk machine: idle → listening → captured → transcribing → ready_to_send. It owns the MediaRecorder, an AnalyserNode for the level meter, and either a SpeechRecognition instance (browser path) or a fetch() to /api/voice/transcribe (native path).

The toolbar mounts under the chat input. The mic button toggles capture; the inline preview lets the user edit the transcript before pressing Send. Errors (mic denial, engine missing, network failure) show in the transcript placeholder.

Backend engines

internal/ai/voice is a thin shell wrapper. Each engine is two files:

whisper.go resolves a whisper-cli binary on $PATH, owns the ~/.quazzar/models/whisper/ggml-tiny.en.bin model, and shells out for each Transcribe(ctx, wav) call. The audio is handed to the binary via os.CreateTemp; the binary returns plain text on stdout.
piper.go is the same shape for the en_US-amy-medium ONNX voice + its .onnx.json config, with Synthesize(ctx, text) piping text in via stdin and reading WAV bytes back from stdout.

Both engines report status via GetStatus():


{
  "available": true,
  "downloading": false,
  "progress": 100,
  "model_size_mb": 75,
  "binary_path": "/usr/local/bin/whisper-cli",
  "model_path": "/home/user/.quazzar/models/whisper/ggml-tiny.en.bin",
  "last_installed": "2026-04-27T15:20:00Z"
}

REST surface

Method	Path	Behaviour
`GET`	`/api/voice/status`	`{whisper, piper}` engine status.
`POST`	`/api/voice/transcribe`	Multipart `audio` (16-bit mono PCM WAV) → `{text}`. 503 `engine_not_available` when whisper isn’t installed.
`POST`	`/api/voice/synthesize`	JSON `{text}` → `audio/wav`. 503 `engine_not_available` when Piper isn’t installed.
`POST`	`/api/voice/install/{engine}`	Pro+ only. Returns `202 Accepted` immediately; the download runs in a goroutine.

Air-gapped installs

If the node has no outbound internet, set QUAZZAR_VOICE_MIRROR to your internal CDN. The installer rewrites every https://github.com/... download URL by replacing the prefix with the mirror value. Other hosts are left untouched.

Pricing

Capability	Community	Pro+
Browser STT/TTS (Web Speech API)	✅	✅
Voice toolbar in Molly	✅	✅
Install whisper.cpp / Piper on the node	🔒	✅

The browser path costs the OS nothing (it runs in the user’s browser), so it stays free on every plan. The native engines run as subprocesses on the OS host and consume CPU + ≈135 MB of disk per node — that’s why the install step itself is paid.

Troubleshooting

Mic permission denied — grant the browser microphone access for the OS hostname; the toolbar will refresh on the next start.
engine_not_available — the user picked the native engine in Settings but neither the whisper-cli binary nor the model are installed. Either flip the engine to “Browser” or click Install on the Pro+ Settings panel.
Hotkey doesn’t fire — open the Hotkeys overlay (Cmd+/) and check that “Toggle Molly voice” is bound. The browser intercepts Cmd+Space on macOS for Spotlight; pick a different combo (e.g. Ctrl+Shift+M) if the default conflicts.