Voice Mode for Molly
Phase Б.3 — release 0.7.7. Talk to Molly hands-free. Push-to-talk in the chat panel, the Cmd+Space hotkey for instant capture, and an optional on-prem speech engine for nodes that want zero-cloud voice.
What you get
- Push-to-talk button in the Molly chat panel, with a state-coloured ring (red while listening, blue while transcribing, green when the transcript is ready to send).
- Cmd+Space hotkey (configurable) toggles capture without moving your hands off the keyboard.
- Browser-side STT/TTS by default — uses the Web Speech API in Chrome / Edge / Safari. Free, zero install, zero server resources.
- Optional on-prem engines — install whisper.cpp (≈75 MB, English) and Piper (≈60 MB, en_US-amy-medium) for fully local-only voice. Lazy download; engines are NOT installed unless the user opts in.
- Mobile path — when a paired phone is connected through the §0.3 pair tunnel, voice frames travel as opaque AEAD frames and the OS-side transcription engine returns the text inline.
Privacy
- No always-listening / wake-word. Push-to-talk only. Browsers show their normal mic indicator while capture is active.
- Recordings are dropped after transcription. The OS processes the WAV bytes in memory and never persists them.
- Browser STT runs in-browser — speech does not leave the device on the default path. Native engines run locally on the OS host. There is no cloud STT in this release.
- Mobile path is end-to-end encrypted through the existing X25519 + XChaCha20-Poly1305 pair tunnel; the relay only sees ciphertext.
Settings
Settings → Molly → Voice exposes:
- STT engine — Auto (default), Browser (Web Speech API), or Native (whisper.cpp).
- TTS engine — Off, Browser (SpeechSynthesis), or Piper (on-prem).
- PTT hotkey — defaults to
Cmd+Space. Edit the spec inline; modifiers areCmd,Ctrl,Alt,Shift, joined with+. - Input device — picker populated from
navigator.mediaDevices.enumerateDevices(). - Install Whisper / Install Piper buttons. 🔒 Pro+ — see Orbit Pro.
How it works
Frontend state machine
useVoice runs the push-to-talk machine: idle → listening → captured → transcribing → ready_to_send. It owns the MediaRecorder, an AnalyserNode for the level meter, and either a SpeechRecognition instance (browser path) or a fetch() to /api/voice/transcribe (native path).
The toolbar mounts under the chat input. The mic button toggles capture; the inline preview lets the user edit the transcript before pressing Send. Errors (mic denial, engine missing, network failure) show in the transcript placeholder.
Backend engines
internal/ai/voice is a thin shell wrapper. Each engine is two files:
whisper.goresolves awhisper-clibinary on$PATH, owns the~/.quazzar/models/whisper/ggml-tiny.en.binmodel, and shells out for eachTranscribe(ctx, wav)call. The audio is handed to the binary viaos.CreateTemp; the binary returns plain text on stdout.piper.gois the same shape for the en_US-amy-medium ONNX voice + its.onnx.jsonconfig, withSynthesize(ctx, text)piping text in via stdin and reading WAV bytes back from stdout.
Both engines report status via GetStatus():
{
"available": true,
"downloading": false,
"progress": 100,
"model_size_mb": 75,
"binary_path": "/usr/local/bin/whisper-cli",
"model_path": "/home/user/.quazzar/models/whisper/ggml-tiny.en.bin",
"last_installed": "2026-04-27T15:20:00Z"
}REST surface
| Method | Path | Behaviour |
|---|---|---|
GET | /api/voice/status | {whisper, piper} engine status. |
POST | /api/voice/transcribe | Multipart audio (16-bit mono PCM WAV) → {text}. 503 engine_not_available when whisper isn’t installed. |
POST | /api/voice/synthesize | JSON {text} → audio/wav. 503 engine_not_available when Piper isn’t installed. |
POST | /api/voice/install/{engine} | Pro+ only. Returns 202 Accepted immediately; the download runs in a goroutine. |
Air-gapped installs
If the node has no outbound internet, set QUAZZAR_VOICE_MIRROR to your internal CDN. The installer rewrites every https://github.com/... download URL by replacing the prefix with the mirror value. Other hosts are left untouched.
Pricing
| Capability | Community | Pro+ |
|---|---|---|
| Browser STT/TTS (Web Speech API) | ✅ | ✅ |
| Voice toolbar in Molly | ✅ | ✅ |
| Install whisper.cpp / Piper on the node | 🔒 | ✅ |
The browser path costs the OS nothing (it runs in the user’s browser), so it stays free on every plan. The native engines run as subprocesses on the OS host and consume CPU + ≈135 MB of disk per node — that’s why the install step itself is paid.
Troubleshooting
- Mic permission denied — grant the browser microphone access for the OS hostname; the toolbar will refresh on the next start.
engine_not_available— the user picked the native engine in Settings but neither thewhisper-clibinary nor the model are installed. Either flip the engine to “Browser” or click Install on the Pro+ Settings panel.- Hotkey doesn’t fire — open the Hotkeys overlay (Cmd+/) and check that “Toggle Molly voice” is bound. The browser intercepts Cmd+Space on macOS for Spotlight; pick a different combo (e.g.
Ctrl+Shift+M) if the default conflicts.