-
Notifications
You must be signed in to change notification settings - Fork 104
Description
Windows STT Solution Analysis (2026-02-02)
Summary
We successfully tested microphone input (STT) on native Windows. The solution uses LiveKit transport, which bypasses the Windows temp file locking bug entirely.
Test Results
Input: "Go ahead, I'm listening for 5 seconds."
Output: "I am saying something with life-keyed transport. Does it work?"
Timing: ttfa 0.9s, gen 0.9s, play 3.5s, record 5.1s, stt 0.3s, total 10.2s
STT Provider: whisper-cpp
Key observation: We did NOT explicitly set transport="livekit". The default transport="auto" detected LiveKit running on port 7880 and used it automatically.
Architecture Clarification
┌──────────────────────────────────────────────────────────────────┐
│ AUDIO CAPTURE (transport) │
├──────────────────────────────────────────────────────────────────┤
│ │
│ transport="local" transport="livekit" │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Microphone │ │ Microphone │ │
│ │ ↓ │ │ ↓ │ │
│ │ Temp WAV file │ ← BUG! │ WebRTC stream │ ← NO BUG │
│ │ ↓ │ │ ↓ │ │
│ │ WinError 32 │ │ LiveKit Server │ │
│ └─────────────────┘ │ ↓ │ │
│ │ Audio to MCP │ │
│ └─────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────┐
│ STT PROCESSING (separate from transport) │
├──────────────────────────────────────────────────────────────────┤
│ │
│ VoiceMode MCP → Whisper Server (port 2022) │
│ │
│ Endpoint options: │
│ • /v1/audio/transcriptions (OpenAI-compatible) │
│ • /inference (native whisper.cpp) │
│ │
└──────────────────────────────────────────────────────────────────┘
Questions Answered
1. Do we need the Whisper endpoint changes from PR #233?
Answer: It depends on your Whisper server build.
| Whisper Server Type | Exposes /v1/audio/transcriptions |
Needs PR #233 whisper changes |
|---|---|---|
| whisper.cpp (vanilla) | No (only /inference) |
YES |
| whisper.cpp with OpenAI API | Yes | No |
| Faster-Whisper | Yes | No |
| OpenAI API | Yes | No |
Our setup: The STT worked, showing (STT: whisper-cpp). This means either:
- We're using a whisper.cpp build that exposes OpenAI-compatible endpoints, OR
- We have the fork with whisper endpoint fixes installed
Check your installation:
# If you installed from the fork with fixes:
pip show voice-mode | grep Location
# Check if it points to your fork directory
# Test which endpoint your whisper uses:
curl http://localhost:2022/v1/audio/transcriptions -F file=@test.wav
curl http://localhost:2022/inference -F file=@test.wav2. Do we need to explicitly use LiveKit instead of local transport?
Answer: On Windows, YES - but transport="auto" handles this automatically.
| Scenario | Recommendation |
|---|---|
| LiveKit running (port 7880) | transport="auto" (default) - auto-selects LiveKit |
| LiveKit NOT running | transport="local" - will hit WinError 32 bug |
| Force LiveKit | transport="livekit" - explicit, fails if not running |
Best practice: Just ensure LiveKit is running, and the default transport="auto" will use it.
PR #233 Component Analysis
| Component | What it fixes | Needed with LiveKit? |
|---|---|---|
| fcntl → msvcrt | File locking in conch.py |
Maybe - Conch is used for multi-agent coordination, not audio capture. If you use wait_for_conch=true, you need this fix. |
| whisper /inference endpoint | STT to whisper.cpp | Depends - Only if your whisper.cpp doesn't expose OpenAI-compatible endpoints |
Recommendations
Minimum for Windows STT to work:
- Run LiveKit server (
C:\voicemode\start-livekit.bat) - Run Whisper server (
C:\voicemode\start-whisper.bat) - Use default
transport="auto"
For complete Windows support (future-proofing):
- Support PR feat: Add native Windows support #233 fcntl changes (needed for
wait_for_conchmulti-agent) - Consider whisper endpoint changes as optional (config-based, not sequential probing)
Updated CLAUDE.md Recommendations
The current CLAUDE.md suggests using transport="livekit" explicitly. This can be simplified:
## Voice on Windows
Services must be running:
- LiveKit: port 7880 (required for Windows mic input)
- Whisper: port 2022 (STT)
- Kokoro: port 8880 (TTS)
No special parameters needed - default `transport="auto"` detects LiveKit.Related Issues/PRs
- PR feat: Add native Windows support #233: Native Windows Support (fcntl + whisper.cpp)
- fcntl changes: Good to merge
- whisper endpoint: Maintainers suggest config option over sequential probing
- Issue Windows 11 WSL Audio Streaming Issue - Choppy Playback After N% of Progress #98: WSL Audio Choppy - LiveKit bypasses this entirely
- Issue [BUG] WinError 32 on Windows - temp file not closed before STT` #135: Windows temp file locking - LiveKit bypasses this entirely
Conclusion
LiveKit transport is the correct solution for Windows. The PR #233 changes are complementary:
- fcntl fix: Still useful for edge cases (multi-agent conch)
- whisper endpoint: Only needed if using vanilla whisper.cpp without OpenAI-compatible endpoints
The fact that our test worked without any special configuration suggests the current setup is correct.