Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
839539c
Update voice_encoder.py
BBC-Esq Jun 7, 2025
3ab1a93
Update tts.py
BBC-Esq Jun 7, 2025
cf207d7
Update s3tokenizer.py
BBC-Esq Jun 7, 2025
c475c4a
Update vc.py
BBC-Esq Jun 7, 2025
003daa8
Update melspec.py
BBC-Esq Jun 7, 2025
1e37b35
Update s3gen.py
BBC-Esq Jun 7, 2025
65b718d
Update flow.py
BBC-Esq Jun 7, 2025
530e9d0
Update flow_matching.py
BBC-Esq Jun 7, 2025
6a48028
Add files via upload
BBC-Esq Jun 7, 2025
7c637d3
Update decoder.py
BBC-Esq Jun 7, 2025
7c0f81f
Update mel.py
BBC-Esq Jun 7, 2025
a1127fa
Update s3tokenizer.py
BBC-Esq Jun 7, 2025
6f8ec3d
Update mel.py
BBC-Esq Jun 7, 2025
f31505c
Update melspec.py
BBC-Esq Jun 7, 2025
ce3328d
Merge pull request #1 from BBC-Esq/remove_librosa
BBC-Esq Jun 7, 2025
9ea12b2
Merge pull request #2 from BBC-Esq/remove_omegaconf
BBC-Esq Jun 7, 2025
9579d3d
Merge pull request #3 from BBC-Esq/remove_conformer
BBC-Esq Jun 7, 2025
5a0cd8e
Update tts.py
BBC-Esq Jun 7, 2025
af26feb
Update vc.py
BBC-Esq Jun 7, 2025
8dda84d
Update pyproject.toml
BBC-Esq Jun 7, 2025
740477a
Update pyproject.toml
BBC-Esq Jun 11, 2025
f3a7591
Update pyproject.toml
BBC-Esq Jun 11, 2025
8b5ff4f
Update README.md
BBC-Esq Jun 11, 2025
331f1c7
Add files via upload
BBC-Esq Dec 25, 2025
937d58b
Add files via upload
BBC-Esq Dec 25, 2025
f06c013
Delete src/chatterbox/models/s3gen/matcha/text_encoder.py
BBC-Esq Dec 25, 2025
420d370
Add files via upload
BBC-Esq Dec 25, 2025
dd05c5c
Add files via upload
BBC-Esq Dec 25, 2025
4d675c0
Add files via upload
BBC-Esq Dec 25, 2025
eaab947
Add files via upload
BBC-Esq Dec 25, 2025
be8d54f
Add files via upload
BBC-Esq Dec 25, 2025
d359d0f
Add files via upload
BBC-Esq Dec 25, 2025
ed48bfc
Add files via upload
BBC-Esq Dec 25, 2025
6789be7
Add files via upload
BBC-Esq Dec 25, 2025
b636efc
Add files via upload
BBC-Esq Dec 25, 2025
7bc80a6
Add files via upload
BBC-Esq Dec 25, 2025
27fe97f
Bump version to 0.1.6 and update dependencies
BBC-Esq Dec 25, 2025
777c377
Ensure .wav files are ignored in all directories
BBC-Esq Dec 25, 2025
aeb070b
Enhance code readability with comments and formatting
BBC-Esq Dec 25, 2025
acc231a
Refactor audio handling and punctuation normalization
BBC-Esq Dec 25, 2025
f41d8ec
Add files via upload
BBC-Esq Dec 25, 2025
66511f7
Add files via upload
BBC-Esq Dec 25, 2025
d47f98f
Add files via upload
BBC-Esq Dec 25, 2025
ad14cb2
Add files via upload
BBC-Esq Dec 25, 2025
c330f52
Add files via upload
BBC-Esq Dec 25, 2025
6bd6832
Revise README with soundfile addition and usage details
BBC-Esq Dec 25, 2025
0e6a227
Add files via upload
BBC-Esq Dec 25, 2025
930c9b4
Clean up pyproject.toml to reflect actual dependencies
BBC-Esq Mar 25, 2026
ee66bb8
Pin transformers to >=4.46.0,<5.0.0 (tested range)
BBC-Esq Mar 25, 2026
4c454a5
Widen transformers version range (tested with 4.46.3 through 5.3.0)
BBC-Esq Mar 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 23 additions & 89 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,103 +1,37 @@
## This is a light version of chatterbox that removes:
* ```librosa``` and all it's required dependencies and instead uses ```torchaudio``` and ```scipy```
* ```perth``` because who the heck needs watermarking anyways
* ```omegaconf``` and instead use standard python
* others I forget, but everything works

<img width="1200" alt="cb-big2" src="https://github.com/user-attachments/assets/bd8c5f03-e91d-4ee5-b680-57355da204d1" />

# Chatterbox TTS

[![Alt Text](https://img.shields.io/badge/listen-demo_samples-blue)](https://resemble-ai.github.io/chatterbox_demopage/)
[![Alt Text](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/ResembleAI/Chatterbox)
[![Alt Text](https://static-public.podonos.com/badges/insight-on-pdns-sm-dark.svg)](https://podonos.com/resembleai/chatterbox)
[![Discord](https://img.shields.io/discord/1377773249798344776?label=join%20discord&logo=discord&style=flat)](https://discord.gg/rJq9cRJBJ6)

_Made with ♥️ by <a href="https://resemble.ai" target="_blank"><img width="100" alt="resemble-logo-horizontal" src="https://github.com/user-attachments/assets/35cf756b-3506-4943-9c72-c05ddfa4e525" /></a>

We're excited to introduce Chatterbox, [Resemble AI's](https://resemble.ai) first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.

Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support **emotion exaggeration control**, a powerful feature that makes your voices stand out. Try it now on our [Hugging Face Gradio app.](https://huggingface.co/spaces/ResembleAI/Chatterbox)

If you like the model but need to scale or tune it for higher accuracy, check out our competitively priced TTS service (<a href="https://resemble.ai">link</a>). It delivers reliable performance with ultra-low latency of sub 200ms—ideal for production use in agents, applications, or interactive media.

# Key Details
- SoTA zeroshot TTS
- 0.5B Llama backbone
- Unique exaggeration/intensity control
- Ultra-stable with alignment-informed inference
- Trained on 0.5M hours of cleaned data
- Watermarked outputs
- Easy voice conversion script
- [Outperforms ElevenLabs](https://podonos.com/resembleai/chatterbox)

# Tips
- **General Use (TTS and Voice Agents):**
- The default settings (`exaggeration=0.5`, `cfg_weight=0.5`) work well for most prompts.
- If the reference speaker has a fast speaking style, lowering `cfg_weight` to around `0.3` can improve pacing.

- **Expressive or Dramatic Speech:**
- Try lower `cfg_weight` values (e.g. `~0.3`) and increase `exaggeration` to around `0.7` or higher.
- Higher `exaggeration` tends to speed up speech; reducing `cfg_weight` helps compensate with slower, more deliberate pacing.

However, I did add ```soundfile``` because I like it.

# Installation
>Go through these steps in order.
```
pip install chatterbox-tts
python -m venv .
```


# Usage
```python
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text)
ta.save("test-1.wav", wav, model.sr)

# If you want to synthesize with a different voice, specify the audio prompt
AUDIO_PROMPT_PATH = "YOUR_FILE.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("test-2.wav", wav, model.sr)
```
See `example_tts.py` and `example_vc.py` for more examples.

# Acknowledgements
- [Cosyvoice](https://github.com/FunAudioLLM/CosyVoice)
- [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)
- [HiFT-GAN](https://github.com/yl4579/HiFTNet)
- [Llama 3](https://github.com/meta-llama/llama3)
- [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)

# Built-in PerTh Watermarking for Responsible AI

Every audio file generated by Chatterbox includes [Resemble AI's Perth (Perceptual Threshold) Watermarker](https://github.com/resemble-ai/perth) - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.


## Watermark extraction

You can look for the watermark using the following script.

```python
import perth
import librosa

AUDIO_PATH = "YOUR_FILE.wav"

# Load the watermarked audio
watermarked_audio, sr = librosa.load(AUDIO_PATH, sr=None)
.\Scripts\activate
```

# Initialize watermarker (same as used for embedding)
watermarker = perth.PerthImplicitWatermarker()
```
python.exe -m pip install --upgrade pip
```

# Extract watermark
watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
print(f"Extracted watermark: {watermark}")
# Output: 0.0 (no watermark) or 1.0 (watermarked)
```
pip install uv
```

Next, make sure appropriate versons of torch, torchaudio, and CUDA are installed.

# Official Discord
```
uv pip install -r requirements.txt
```

👋 Join us on [Discord](https://discord.gg/rJq9cRJBJ6) and let's build something awesome together!
```
pip install git+https://github.com/BBC-Esq/chatterbox-light.git --no-deps
```

# Disclaimer
Don't use this model to do bad things. Prompts are sourced from freely available data on the internet.
Enjoy!
31 changes: 16 additions & 15 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,31 +1,32 @@
[project]
name = "chatterbox-tts"
version = "0.1.2"
version = "0.1.6"
description = "Chatterbox: Open Source TTS and Voice Conversion by Resemble AI"
readme = "README.md"
requires-python = ">=3.8"
requires-python = ">=3.10"
license = {file = "LICENSE"}
authors = [
{name = "resemble-ai", email = "engineering@resemble.ai"}
]
dependencies = [
"numpy~=1.26.0",
"resampy==0.4.3",
"librosa==0.11.0",
"numpy",
"torch",
"torchaudio",
"transformers>=4.46.0",
"diffusers",
"safetensors",
"s3tokenizer",
"torch==2.6.0",
"torchaudio==2.6.0",
"transformers==4.46.3",
"diffusers==0.29.0",
"resemble-perth==1.0.1",
"omegaconf==2.3.0",
"conformer==0.3.2",
"safetensors==0.5.3"
"einops",
"scipy",
"pyloudnorm",
"soundfile",
"tokenizers",
"tqdm",
]

[project.urls]
Homepage = "https://github.com/resemble-ai/chatterbox"
Repository = "https://github.com/resemble-ai/chatterbox"
Homepage = "https://github.com/BBC-Esq/chatterbox-light"
Repository = "https://github.com/BBC-Esq/chatterbox-light"

[build-system]
requires = ["setuptools>=61.0"]
Expand Down
10 changes: 10 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
s3tokenizer
transformers
diffusers
safetensors
pyloudnorm
numpy
hf_xet
soundfile
# remember to pip install this below after requirements.txt
# git+https://github.com/BBC-Esq/chatterbox-light.git --no-deps
13 changes: 11 additions & 2 deletions src/chatterbox/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,11 @@
from .tts import ChatterboxTTS
from .vc import ChatterboxVC
try:
from importlib.metadata import version
except ImportError:
from importlib_metadata import version # For Python <3.8

__version__ = version("chatterbox-tts")


from .tts import ChatterboxTTS
from .vc import ChatterboxVC
from .mtl_tts import ChatterboxMultilingualTTS, SUPPORTED_LANGUAGES
Empty file.
10 changes: 10 additions & 0 deletions src/chatterbox/models/s3gen/configs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from ..utils import AttrDict

CFM_PARAMS = AttrDict({
"sigma_min": 1e-06,
"solver": "euler",
"t_scheduler": "cosine",
"training_cfg_rate": 0.2,
"inference_cfg_rate": 0.7,
"reg_loss_type": "l1"
})
1 change: 1 addition & 0 deletions src/chatterbox/models/s3gen/const.py
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
S3GEN_SR = 24000
S3GEN_SIL = 4299
69 changes: 22 additions & 47 deletions src/chatterbox/models/s3gen/decoder.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,3 @@
# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Zhihao Du)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import torch.nn as nn
import torch.nn.functional as F
Expand All @@ -20,15 +7,13 @@
from .matcha.decoder import SinusoidalPosEmb, Block1D, ResnetBlock1D, Downsample1D, \
TimestepEmbedding, Upsample1D
from .matcha.transformer import BasicTransformerBlock
from .utils.intmeanflow import get_intmeanflow_time_mixer


def mask_to_bias(mask: torch.Tensor, dtype: torch.dtype) -> torch.Tensor:
assert mask.dtype == torch.bool
assert dtype in [torch.float32, torch.bfloat16, torch.float16]
mask = mask.to(dtype)
# attention mask bias
# NOTE(Mddct): torch.finfo jit issues
# chunk_masks = (1.0 - chunk_masks) * torch.finfo(dtype).min
mask = (1.0 - mask) * -1.0e+10
return mask

Expand Down Expand Up @@ -95,8 +80,6 @@ def forward(self, x: torch.Tensor):
x = F.pad(x, self.causal_padding)
x = super(CausalConv1d, self).forward(x)
return x


class ConditionalDecoder(nn.Module):
def __init__(
self,
Expand All @@ -110,13 +93,11 @@ def __init__(
num_mid_blocks=12,
num_heads=8,
act_fn="gelu",
meanflow=False,
):
"""
This decoder requires an input with the same shape of the target. So, if your text content
is shorter or longer than the outputs, please re-sampling it before feeding to the decoder.
"""
super().__init__()
channels = tuple(channels)
self.meanflow = meanflow
self.in_channels = in_channels
self.out_channels = out_channels
self.causal = causal
Expand All @@ -127,15 +108,15 @@ def __init__(
time_embed_dim=time_embed_dim,
act_fn="silu",
)

self.down_blocks = nn.ModuleList([])
self.mid_blocks = nn.ModuleList([])
self.up_blocks = nn.ModuleList([])

# NOTE jrm: `static_chunk_size` is missing?
self.static_chunk_size = 0

output_channel = in_channels
for i in range(len(channels)): # pylint: disable=consider-using-enumerate
for i in range(len(channels)):
input_channel = output_channel
output_channel = channels[i]
is_last = i == len(channels) - 1
Expand Down Expand Up @@ -215,6 +196,14 @@ def __init__(
self.final_block = CausalBlock1D(channels[-1], channels[-1]) if self.causal else Block1D(channels[-1], channels[-1])
self.final_proj = nn.Conv1d(channels[-1], self.out_channels, 1)
self.initialize_weights()
self.time_embed_mixer = None
if self.meanflow:
self.time_embed_mixer = get_intmeanflow_time_mixer(time_embed_dim)


@property
def dtype(self):
return self.final_proj.weight.dtype

def initialize_weights(self):
for m in self.modules():
Expand All @@ -230,27 +219,16 @@ def initialize_weights(self):
if m.bias is not None:
nn.init.constant_(m.bias, 0)

def forward(self, x, mask, mu, t, spks=None, cond=None):
"""Forward pass of the UNet1DConditional model.

Args:
x (torch.Tensor): shape (batch_size, in_channels, time)
mask (_type_): shape (batch_size, 1, time)
t (_type_): shape (batch_size)
spks (_type_, optional): shape: (batch_size, condition_channels). Defaults to None.
cond (_type_, optional): placeholder for future use. Defaults to None.

Raises:
ValueError: _description_
ValueError: _description_

Returns:
_type_: _description_
"""

def forward(self, x, mask, mu, t, spks=None, cond=None, r=None):
t = self.time_embeddings(t).to(t.dtype)
t = self.time_mlp(t)

if self.meanflow:
r = self.time_embeddings(r).to(t.dtype)
r = self.time_mlp(r)
concat_embed = torch.cat([t, r], dim=1)
t = self.time_embed_mixer(concat_embed)

x = pack([x, mu], "b * t")[0]

if spks is not None:
Expand All @@ -265,7 +243,6 @@ def forward(self, x, mask, mu, t, spks=None, cond=None):
mask_down = masks[-1]
x = resnet(x, mask_down, t)
x = rearrange(x, "b c t -> b t c").contiguous()
# attn_mask = torch.matmul(mask_down.transpose(1, 2).contiguous(), mask_down)
attn_mask = add_optional_chunk_mask(x, mask_down.bool(), False, False, 0, self.static_chunk_size, -1)
attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
for transformer_block in transformer_blocks:
Expand All @@ -275,7 +252,7 @@ def forward(self, x, mask, mu, t, spks=None, cond=None):
timestep=t,
)
x = rearrange(x, "b t c -> b c t").contiguous()
hiddens.append(x) # Save hidden states for skip connections
hiddens.append(x)
x = downsample(x * mask_down)
masks.append(mask_down[:, :, ::2])
masks = masks[:-1]
Expand All @@ -284,7 +261,6 @@ def forward(self, x, mask, mu, t, spks=None, cond=None):
for resnet, transformer_blocks in self.mid_blocks:
x = resnet(x, mask_mid, t)
x = rearrange(x, "b c t -> b t c").contiguous()
# attn_mask = torch.matmul(mask_mid.transpose(1, 2).contiguous(), mask_mid)
attn_mask = add_optional_chunk_mask(x, mask_mid.bool(), False, False, 0, self.static_chunk_size, -1)
attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
for transformer_block in transformer_blocks:
Expand All @@ -301,7 +277,6 @@ def forward(self, x, mask, mu, t, spks=None, cond=None):
x = pack([x[:, :, :skip.shape[-1]], skip], "b * t")[0]
x = resnet(x, mask_up, t)
x = rearrange(x, "b c t -> b t c").contiguous()
# attn_mask = torch.matmul(mask_up.transpose(1, 2).contiguous(), mask_up)
attn_mask = add_optional_chunk_mask(x, mask_up.bool(), False, False, 0, self.static_chunk_size, -1)
attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
for transformer_block in transformer_blocks:
Expand All @@ -314,4 +289,4 @@ def forward(self, x, mask, mu, t, spks=None, cond=None):
x = upsample(x * mask_up)
x = self.final_block(x, mask_up)
output = self.final_proj(x * mask_up)
return output * mask
return output * mask
13 changes: 0 additions & 13 deletions src/chatterbox/models/s3gen/f0_predictor.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,3 @@
# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Kai Hu)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import torch.nn as nn
from torch.nn.utils.parametrizations import weight_norm
Expand Down
Loading