Skip to content

[ Windows ] Windows STT Solution Analysis #239

@kindcreator

Description

@kindcreator

Windows STT Solution Analysis (2026-02-02)

Summary

We successfully tested microphone input (STT) on native Windows. The solution uses LiveKit transport, which bypasses the Windows temp file locking bug entirely.

Test Results

Input:  "Go ahead, I'm listening for 5 seconds."
Output: "I am saying something with life-keyed transport. Does it work?"
Timing: ttfa 0.9s, gen 0.9s, play 3.5s, record 5.1s, stt 0.3s, total 10.2s
STT Provider: whisper-cpp

Key observation: We did NOT explicitly set transport="livekit". The default transport="auto" detected LiveKit running on port 7880 and used it automatically.

Architecture Clarification

┌──────────────────────────────────────────────────────────────────┐
│                    AUDIO CAPTURE (transport)                      │
├──────────────────────────────────────────────────────────────────┤
│                                                                    │
│  transport="local"              transport="livekit"               │
│  ┌─────────────────┐            ┌─────────────────┐               │
│  │ Microphone      │            │ Microphone      │               │
│  │      ↓          │            │      ↓          │               │
│  │ Temp WAV file   │ ← BUG!     │ WebRTC stream   │ ← NO BUG      │
│  │      ↓          │            │      ↓          │               │
│  │ WinError 32     │            │ LiveKit Server  │               │
│  └─────────────────┘            │      ↓          │               │
│                                 │ Audio to MCP    │               │
│                                 └─────────────────┘               │
└──────────────────────────────────────────────────────────────────┘
                                        ↓
┌──────────────────────────────────────────────────────────────────┐
│                    STT PROCESSING (separate from transport)       │
├──────────────────────────────────────────────────────────────────┤
│                                                                    │
│  VoiceMode MCP → Whisper Server (port 2022)                       │
│                                                                    │
│  Endpoint options:                                                 │
│  • /v1/audio/transcriptions  (OpenAI-compatible)                  │
│  • /inference                (native whisper.cpp)                 │
│                                                                    │
└──────────────────────────────────────────────────────────────────┘

Questions Answered

1. Do we need the Whisper endpoint changes from PR #233?

Answer: It depends on your Whisper server build.

Whisper Server Type Exposes /v1/audio/transcriptions Needs PR #233 whisper changes
whisper.cpp (vanilla) No (only /inference) YES
whisper.cpp with OpenAI API Yes No
Faster-Whisper Yes No
OpenAI API Yes No

Our setup: The STT worked, showing (STT: whisper-cpp). This means either:

  1. We're using a whisper.cpp build that exposes OpenAI-compatible endpoints, OR
  2. We have the fork with whisper endpoint fixes installed

Check your installation:

# If you installed from the fork with fixes:
pip show voice-mode | grep Location
# Check if it points to your fork directory

# Test which endpoint your whisper uses:
curl http://localhost:2022/v1/audio/transcriptions -F file=@test.wav
curl http://localhost:2022/inference -F file=@test.wav

2. Do we need to explicitly use LiveKit instead of local transport?

Answer: On Windows, YES - but transport="auto" handles this automatically.

Scenario Recommendation
LiveKit running (port 7880) transport="auto" (default) - auto-selects LiveKit
LiveKit NOT running transport="local" - will hit WinError 32 bug
Force LiveKit transport="livekit" - explicit, fails if not running

Best practice: Just ensure LiveKit is running, and the default transport="auto" will use it.

PR #233 Component Analysis

Component What it fixes Needed with LiveKit?
fcntl → msvcrt File locking in conch.py Maybe - Conch is used for multi-agent coordination, not audio capture. If you use wait_for_conch=true, you need this fix.
whisper /inference endpoint STT to whisper.cpp Depends - Only if your whisper.cpp doesn't expose OpenAI-compatible endpoints

Recommendations

Minimum for Windows STT to work:

  1. Run LiveKit server (C:\voicemode\start-livekit.bat)
  2. Run Whisper server (C:\voicemode\start-whisper.bat)
  3. Use default transport="auto"

For complete Windows support (future-proofing):

  1. Support PR feat: Add native Windows support #233 fcntl changes (needed for wait_for_conch multi-agent)
  2. Consider whisper endpoint changes as optional (config-based, not sequential probing)

Updated CLAUDE.md Recommendations

The current CLAUDE.md suggests using transport="livekit" explicitly. This can be simplified:

## Voice on Windows

Services must be running:
- LiveKit: port 7880 (required for Windows mic input)
- Whisper: port 2022 (STT)
- Kokoro: port 8880 (TTS)

No special parameters needed - default `transport="auto"` detects LiveKit.

Related Issues/PRs

Conclusion

LiveKit transport is the correct solution for Windows. The PR #233 changes are complementary:

  • fcntl fix: Still useful for edge cases (multi-agent conch)
  • whisper endpoint: Only needed if using vanilla whisper.cpp without OpenAI-compatible endpoints

The fact that our test worked without any special configuration suggests the current setup is correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions