Skip to content

Add TTSKit with Qwen3-TTS support#425

Open
ZachNagengast wants to merge 1 commit intomainfrom
ttskit
Open

Add TTSKit with Qwen3-TTS support#425
ZachNagengast wants to merge 1 commit intomainfrom
ttskit

Conversation

@ZachNagengast
Copy link
Contributor

@ZachNagengast ZachNagengast commented Feb 19, 2026

WhisperKit is expanding into text-to-speech!

TTSKit adds a new library for on-device text-to-speech using Core ML-accelerated Qwen3-TTS models (CustomVoice 0.6B and 1.7B in this first release) with real-time streaming playback on Apple Silicon. In this first PR, we're introducing the library into the WhisperKit package (WhisperKit will be renamed to reflect the new multi-Kit nature of Argmax Open-source SDK) as an optional import to add real-time TTS capabilities with a state-of-the-art open-source model, either on its own or as a complement to WhisperKit speech-to-text.

This PR is still in the final phases of development, but here are a few highlights:

TTSKit Library

  • Download, load, generate, and stream playback in ~3 lines of code.
  • Protocol-based component architecture (6 swappable Core ML components: TextProjecting, CodeEmbedding, MultiCodeEmbedding, CodeDecoding, MultiCodeDecoding, SpeechDecoding) for plugging in new model backends.
  • Qwen3-TTS implementation with 9 built-in voices, 10 languages, and style instruction support (1.7b variant only).
  • Automatic text chunking for long-form generations with concurrent chunk generation and cross-fade stitching.
  • Adaptive streaming playback (TTSPlaybackStrategy.auto) that measures first-step latency to pre-buffer just enough audio.
  • Seedable RNG for reproducible generation.
  • WAV and M4A (AAC) audio export

Example usage playing audio in real-time out of the default speaker:

    let ttsKit = try await TTSKit()
    try await ttsKit.playSpeech(text: "Hello from TTSKit!")

New target: ArgmaxCore

  • Extracted a shared target with various utilities from WhisperKit so TTSKit can share them without depending on it directly

CLI

  • For now we plan to deploy this as a new command on whisperkit-cli tts that can be used like this:
    • swift run whisperkit-cli tts --text "Hello from TTSKit" --play
    • Full control over speaker, language, model, style instruction, temperature, chunking, compute units, and seed.

TTSKit Example app

  • macOS and iOS example app with model management, real-time waveform visualization, generation history persisted as M4A files, and more. Use this as a quick way to try it out!
ttskit-example-app

Roadmap

We plan to continue to add support for state-of-the-art models and improve inference latency for TTSKit over the next few weeks. The immediate follow-up is the voice cloning feature from Qwen3-TTS and a 2x reduction in time-to-first-byte (TTFB) so this on-device project achieves a consistent sub-100 ms, providing a latency edge over cloud deployments of the same model. In the meantime, we encourage anyone reading this to check out this PR, give it a spin, and let us know how it goes!


@MainActor
@Observable
final class ViewModel: @unchecked Sendable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would break this down to smaller viewmodels if it goes too long, e.g DownloadViewModel vs. TTSViewModel

@@ -1,5 +1,5 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had an old swift-transformers resolved

/// Thin wrapper around `os_unfair_lock` that exposes a Swift-friendly
/// `withLock` helper. This lock is non-reentrant and optimized for low
/// contention, matching the semantics of Core Foundation's unfair lock.
public final class UnfairLock: @unchecked Sendable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to make this class name generic for future proof with swift6, seems os_unfair_lock is not the recommended way to lock in swift 6.
probably rename it Mutext so we can reimp it with actual Swift.Mutext later
now

public final class Mutex: @unchecked Sendable {
    private let lock = OSAllocatedUnfairLock()

    public init() {}

    @inlinable
    public func withLock<T>(_ body: () throws -> T) rethrows -> T {
        try lock.withLock(body)
    }
}


later

public final class Mutex: Sendable {
private let mutex: Swift.Mutex

public init(_ value: Value) {
    self.mutex = Mutex(value)
}

public func withLock<T>(_ body: (inout Value) throws -> T) rethrows -> T {
    try mutex.withLock(body)
}

}

@@ -0,0 +1,86 @@
// For licensing see accompanying LICENSE.md file.
// Copyright © 2024 Argmax, Inc. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026, ditto to others

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider adding another package under ArgmaxCore? like ArgmaxCore/CoreML

///
/// Downloads only the files matching the configured component variants.
/// Files are cached locally by the Hub library.
open class func download(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we decouple model download from TTSKit? ArgmaxCore could provide a downloader for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep have some todos relating to this

// Copyright © 2026 Argmax, Inc. All rights reserved.

import Accelerate
@_exported import ArgmaxCore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why @_exported?

)

XCTAssertGreaterThan(result.audio.count, 0, "Audio samples should be non-empty")
XCTAssertGreaterThan(result.audioDuration, 1.0, "Expect at least 1s of speech")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will seed guarantee the audio length is always deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// For licensing see accompanying LICENSE.md file.
// Copyright © 2024 Argmax, Inc. All rights reserved.

import ArgmaxCore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would want to break these test down to isolated class test.

e.g1 TTSKitTest.swift that injects a Config with mocked components, and verify
TTSKitTest.generateSpeech interacts with the components correctly, tasks created etc.

e.g2 Qwen3TTSGenerateTaskTest.swfit that inejcts mocked components. verify run interacts with them correctly

/// owns its own sampler (derived seed) so concurrent tasks don't share RNG state.
/// Model components are shared read-only references - `MLModel.prediction()` is
/// thread-safe. The class is `@unchecked Sendable` to permit `open` subclassing.
open class TTSGenerateTask: @unchecked Sendable, TTSGenerating {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the class be renamed to Qwen3TTSGenerateTask ? ditto to other files under Qwen3TTS

@argmaxinc argmaxinc deleted a comment from chen-argmax Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments