Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 37 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,23 @@ Transcribe input from your microphone and turn it into key presses on a virtual
<img src="https://github.com/user-attachments/assets/bc1de2af-e07b-4460-a522-b140a041a3db" alt="VoxInput Robot Mascot" width="400">
</p>

VoxInput is meant to be used with [LocalAI](https://localai.io), but it will function with any OpenAI compatible API that provides the transcription endpoint.
VoxInput is meant to be used with [LocalAI](https://localai.io), but it will function with any OpenAI compatible API that provides the transcription endpoint or realtime API.

## Features

- **Speech-to-Text Daemon**: Runs as a background process to listen for signals to start or stop recording audio.
- **Audio Capture and Playback**: Records audio from the microphone and plays it back for verification.
- **Transcription**: Converts recorded audio into text using a local or remote transcription service.
- **Text Automation**: Simulates typing the transcribed text into an application using [`dotool`](https://git.sr.ht/~geb/dotool).
- **Voice Activity Detection**: In realtime mode VoxInput uses VAD to detect speech segments and automatically transcribe them.

## Requirements

- `dotool` (for simulating keyboard input)
- `OPENAI_API_KEY` or `VOXINPUT_API_KEY`: Your OpenAI API key for Whisper transcription. If you have a local instance with no key, then just leave it unset.
- `OPENAPI_BASE_URL` or `VOXINPUT_BASE_URL`: The base URL of the OpenAI Whisper API server: defaults to `http://localhost:8080/v1`
- `OPENAI_BASE_URL` or `VOXINPUT_BASE_URL`: The base URL of the OpenAI compatible API server: defaults to `http://localhost:8080/v1`
- `OPENAI_WS_BASE_URL` or `VOXINPUT_WS_BASE_URL`: The base URL of the realtime websocket API: defaults to `ws://localhost:8080/v1/realtime`
- OpenAI Realtime API support - VoxInput's realtime mode with VAD requires a [websocket endpoint that support's OpenAI's realtime API in transcription only mode](https://github.com/mudler/LocalAI/pull/5392). You can disable realtime mode with `--no-realtime`.

Note that the VoxInput env vars take precedence over the OpenAI ones.

Expand Down Expand Up @@ -47,19 +50,14 @@ KERNEL=="uinput", GROUP="input", MODE="0620", OPTIONS+="static_node=uinput"
cd VoxInput
```

2. Install dependencies:
2. Build the project:
```bash
go mod tidy
go build -mod=vendor -o voxinput
```

3. Build the project:
```bash
go build -o voxinput
```

4. Ensure `dotool` is installed on your system and it can make key presses.
3. Ensure `dotool` is installed on your system and it can make key presses.

5. It makes sense to bind the `record` and `write` commands to keys using your window manager. For instance in my Sway config I have the following
4. It makes sense to bind the `record` and `write` commands to keys using your window manager. For instance in my Sway config I have the following

```
bindsym $mod+Shift+t exec voxinput record
Expand All @@ -70,19 +68,22 @@ Alternatively you can use the Nix flake.

## Usage

The `LANG` and `VOXINPUT_LANG` environment variables are used to tell the transcription service which language to use.
For multi-lingual use set `VOXINPUT_LANG` to an empty string.

### Commands

- **`listen`**: Starts the speech-to-text daemon.
```bash
./voxinput listen
```

- **`record`**: Sends a signal to the daemon to start recording audio then exits.
- **`record`**: Sends a signal to the daemon to start recording audio then exits. In realtime mode this will start transcription.
```bash
./voxinput record
```

- **`write`**: Sends a signal to the daemon to stop recording, transcribe the audio, and simulate typing the text.
- **`write`** or **`stop`**: Sends a signal to the daemon to stop recording. When not in realtime mode this triggers transcription.
```bash
./voxinput write
```
Expand All @@ -92,11 +93,30 @@ Alternatively you can use the Nix flake.
./voxinput help
```

### Example Realtime Workflow

1. Start the daemon in a terminal window:
```bash
OPENAI_BASE_URL=http://ai.local:8081/v1 OPENAI_WS_BASE_URL=ws://ai.local:8081/v1/realtime ./voxinput listen
```

2. Select a text box you want to speak into and use a global shortcut to run the following
```bash
./voxinput record
```

3. Begin speaking, when you pause for a second or two your speach will be transcribed and typed into the active application.

4. Send a signal to stop recording
```bash
./voxinput stop
```

### Example Workflow

1. Start the daemon in a terminal window:
```bash
./voxinput listen
OPENAI_BASE_URL=http://ai.local:8081/v1 ./voxinput listen --no-realtime
```

2. Select a text box you want to speak into and use a global shortcut to run the following
Expand All @@ -115,9 +135,10 @@ Alternatively you can use the Nix flake.

- [x] Put playback behind a debug switch
- [x] Create a release
- [ ] Realtime Transcription
- [x] Realtime Transcription
- [ ] GUI and system tray
- [ ] Voice detection and activation
- [x] Voice detection and activation (partial, see below)
- [ ] Code words to start and stop transcription
- [ ] Allow user to describe a button they want to press (requires submitting screen shot and transcription to LocalAGI)

## Signals
Expand Down
4 changes: 2 additions & 2 deletions flake.nix
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,12 @@
{
default = pkgs.buildGoModule {
pname = "voxinput";
version = "0.3.0";
version = "0.4.0";

# Path to the source code
src = ./.;

vendorHash = "sha256-OserWlRhKyTvLrYSikNCjdDdTATIcWTfqJi9n4mHVLE="; #nixpkgs.lib.fakeHash;
vendorHash = null; #nixpkgs.lib.fakeHash;

nativeBuildInputs = with pkgs; [
makeWrapper
Expand Down
7 changes: 6 additions & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,11 @@ module github.com/richiejp/VoxInput
go 1.24.2

require (
github.com/WqyJh/go-openai-realtime v0.5.0
github.com/gen2brain/malgo v0.11.23
github.com/sashabaranov/go-openai v1.39.1
github.com/sashabaranov/go-openai v1.32.0
)

replace github.com/WqyJh/go-openai-realtime => ../go-openai-realtime

require github.com/coder/websocket v1.8.12 // indirect
16 changes: 14 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
@@ -1,4 +1,16 @@
github.com/WqyJh/jsontools v0.3.1 h1:zKT+DvxUSTji06ZcjsbQzZ48PycFZDI0OGATmmFhJ+U=
github.com/WqyJh/jsontools v0.3.1/go.mod h1:Gk2OlyXjAJmYNZ0aUbEXGHq4I5ihGRjXxVuUprWtkss=
github.com/coder/websocket v1.8.12 h1:5bUXkEPPIbewrnkU8LTCLVaxi4N4J8ahufH2vlo4NAo=
github.com/coder/websocket v1.8.12/go.mod h1:LNVeNrXQZfe5qhS9ALED3uA+l5pPqvwXg3CKoDBB2gs=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/gen2brain/malgo v0.11.23 h1:3/VAI8DP9/Wyx1CUDNlUQJVdWUvGErhjHDqYcHVk9ME=
github.com/gen2brain/malgo v0.11.23/go.mod h1:f9TtuN7DVrXMiV/yIceMeWpvanyVzJQMlBecJFVMxww=
github.com/sashabaranov/go-openai v1.39.1 h1:TMD4w77Iy9WTFlgnjNaxbAASdsCJ9R/rMdzL+SN14oU=
github.com/sashabaranov/go-openai v1.39.1/go.mod h1:lj5b/K+zjTSFxVLijLSTDZuP7adOgerWeFyZLUhAKRg=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/sashabaranov/go-openai v1.32.0 h1:Yk3iE9moX3RBXxrof3OBtUBrE7qZR0zF9ebsoO4zVzI=
github.com/sashabaranov/go-openai v1.32.0/go.mod h1:lj5b/K+zjTSFxVLijLSTDZuP7adOgerWeFyZLUhAKRg=
github.com/stretchr/testify v1.9.0 h1:HtqpIVDClZ4nwg75+f6Lvsy/wHu+3BoSGCbBAcpTsTg=
github.com/stretchr/testify v1.9.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
18 changes: 9 additions & 9 deletions internal/audio/audio.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ import (
// StreamConfig describes the parameters for an audio stream.
// Default values will pick the defaults of the default device.
type StreamConfig struct {
Format malgo.FormatType
Channels int
SampleRate int
DeviceType malgo.DeviceType
MalgoContext malgo.Context
Format malgo.FormatType
Channels int
SampleRate int
DeviceType malgo.DeviceType
MalgoContext malgo.Context
}

func (config StreamConfig) asDeviceConfig(deviceType malgo.DeviceType) malgo.DeviceConfig {
Expand All @@ -33,13 +33,13 @@ func (config StreamConfig) asDeviceConfig(deviceType malgo.DeviceType) malgo.Dev
deviceConfig.SampleRate = uint32(config.SampleRate)
}
if config.DeviceType != 0 {
deviceConfig.DeviceType = config.DeviceType
deviceConfig.DeviceType = config.DeviceType
}
return deviceConfig
}

func stream(ctx context.Context, abortChan chan error, config StreamConfig, deviceCallbacks malgo.DeviceCallbacks) error {
deviceConfig := config.asDeviceConfig(malgo.Capture)
deviceConfig := config.asDeviceConfig(malgo.Capture)
device, err := malgo.InitDevice(config.MalgoContext, deviceConfig, deviceCallbacks)
if err != nil {
return err
Expand Down Expand Up @@ -71,7 +71,7 @@ func stream(ctx context.Context, abortChan chan error, config StreamConfig, devi
// Capturing will commence writing the samples to the writer until either the
// writer returns an error, or the context signals done.
func Capture(ctx context.Context, w io.Writer, config StreamConfig) error {
config.DeviceType = malgo.Capture
config.DeviceType = malgo.Capture
abortChan := make(chan error)
defer close(abortChan)
aborted := false
Expand Down Expand Up @@ -99,7 +99,7 @@ func Capture(ctx context.Context, w io.Writer, config StreamConfig) error {
// Playback will commence playing the samples provided from the reader until either the
// reader returns an error, or the context signals done.
func Playback(ctx context.Context, r io.Reader, config StreamConfig) error {
config.DeviceType = malgo.Playback
config.DeviceType = malgo.Playback
abortChan := make(chan error)
defer close(abortChan)
aborted := false
Expand Down
34 changes: 17 additions & 17 deletions internal/audio/wav.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,26 +8,26 @@ import (
// WAVHeader represents the WAV file header (44 bytes for PCM)
type WAVHeader struct {
// RIFF Chunk (12 bytes)
ChunkID [4]byte
ChunkSize uint32
Format [4]byte
ChunkID [4]byte
ChunkSize uint32
Format [4]byte

// fmt Subchunk (16 bytes)
Subchunk1ID [4]byte
Subchunk1Size uint32
AudioFormat uint16
NumChannels uint16
SampleRate uint32
ByteRate uint32
BlockAlign uint16
BitsPerSample uint16
Subchunk1ID [4]byte
Subchunk1Size uint32
AudioFormat uint16
NumChannels uint16
SampleRate uint32
ByteRate uint32
BlockAlign uint16
BitsPerSample uint16

// data Subchunk (8 bytes)
Subchunk2ID [4]byte
Subchunk2Size uint32
Subchunk2ID [4]byte
Subchunk2Size uint32
}

func NewWAVHeader(pcmData []byte) WAVHeader {
func NewWAVHeader(pcmLen uint32) WAVHeader {
header := WAVHeader{
ChunkID: [4]byte{'R', 'I', 'F', 'F'},
Format: [4]byte{'W', 'A', 'V', 'E'},
Expand All @@ -40,14 +40,14 @@ func NewWAVHeader(pcmData []byte) WAVHeader {
BlockAlign: 2, // 16-bit = 2 bytes per sample
BitsPerSample: 16,
Subchunk2ID: [4]byte{'d', 'a', 't', 'a'},
Subchunk2Size: uint32(len(pcmData)),
Subchunk2Size: pcmLen,
}

header.ChunkSize = 36 + header.Subchunk2Size
header.ChunkSize = 36 + header.Subchunk2Size

return header
}

func (h *WAVHeader) Write(writer io.Writer) error {
return binary.Write(writer, binary.LittleEndian, h)
return binary.Write(writer, binary.LittleEndian, h)
}
Loading