resemble-ai · BBC-Esq · Jun 7, 2025 · Jun 7, 2025 · Jun 7, 2025 · Jun 7, 2025
diff --git a/README.md b/README.md
@@ -1,103 +1,37 @@
+## This is a light version of chatterbox that removes:
+* ```librosa``` and all it's required dependencies and instead uses ```torchaudio``` and ```scipy```
+* ```perth``` because who the heck needs watermarking anyways
+* ```omegaconf``` and instead use standard python
+* others I forget, but everything works
 
-<img width="1200" alt="cb-big2" src="https://github.com/user-attachments/assets/bd8c5f03-e91d-4ee5-b680-57355da204d1" />
-
-# Chatterbox TTS
-
-[![Alt Text](https://img.shields.io/badge/listen-demo_samples-blue)](https://resemble-ai.github.io/chatterbox_demopage/)
-[![Alt Text](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/ResembleAI/Chatterbox)
-[![Alt Text](https://static-public.podonos.com/badges/insight-on-pdns-sm-dark.svg)](https://podonos.com/resembleai/chatterbox)
-[![Discord](https://img.shields.io/discord/1377773249798344776?label=join%20discord&logo=discord&style=flat)](https://discord.gg/rJq9cRJBJ6)
-
-_Made with ♥️ by <a href="https://resemble.ai" target="_blank"><img width="100" alt="resemble-logo-horizontal" src="https://github.com/user-attachments/assets/35cf756b-3506-4943-9c72-c05ddfa4e525" /></a>
-
-We're excited to introduce Chatterbox, [Resemble AI's](https://resemble.ai) first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.
-
-Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support **emotion exaggeration control**, a powerful feature that makes your voices stand out. Try it now on our [Hugging Face Gradio app.](https://huggingface.co/spaces/ResembleAI/Chatterbox)
-
-If you like the model but need to scale or tune it for higher accuracy, check out our competitively priced TTS service (<a href="https://resemble.ai">link</a>). It delivers reliable performance with ultra-low latency of sub 200ms—ideal for production use in agents, applications, or interactive media.
-
-# Key Details
-- SoTA zeroshot TTS
-- 0.5B Llama backbone
-- Unique exaggeration/intensity control
-- Ultra-stable with alignment-informed inference
-- Trained on 0.5M hours of cleaned data
-- Watermarked outputs
-- Easy voice conversion script
-- [Outperforms ElevenLabs](https://podonos.com/resembleai/chatterbox)
-
-# Tips
-- **General Use (TTS and Voice Agents):**
-  - The default settings (`exaggeration=0.5`, `cfg_weight=0.5`) work well for most prompts.
-  - If the reference speaker has a fast speaking style, lowering `cfg_weight` to around `0.3` can improve pacing.
-
-- **Expressive or Dramatic Speech:**
-  - Try lower `cfg_weight` values (e.g. `~0.3`) and increase `exaggeration` to around `0.7` or higher.
-  - Higher `exaggeration` tends to speed up speech; reducing `cfg_weight` helps compensate with slower, more deliberate pacing.
-
+However, I did add ```soundfile``` because I like it.
 
 # Installation
+>Go through these steps in order.
 ```
-pip install chatterbox-tts
+python -m venv .
 ```
 
-
-# Usage
-```python
-import torchaudio as ta
-from chatterbox.tts import ChatterboxTTS
-
-model = ChatterboxTTS.from_pretrained(device="cuda")
-
-text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
-wav = model.generate(text)
-ta.save("test-1.wav", wav, model.sr)
-
-# If you want to synthesize with a different voice, specify the audio prompt
-AUDIO_PROMPT_PATH = "YOUR_FILE.wav"
-wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
-ta.save("test-2.wav", wav, model.sr)
 ```
-See `example_tts.py` and `example_vc.py` for more examples.
-
-# Acknowledgements
-- [Cosyvoice](https://github.com/FunAudioLLM/CosyVoice)
-- [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)
-- [HiFT-GAN](https://github.com/yl4579/HiFTNet)
-- [Llama 3](https://github.com/meta-llama/llama3)
-- [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
-
-# Built-in PerTh Watermarking for Responsible AI
-
-Every audio file generated by Chatterbox includes [Resemble AI's Perth (Perceptual Threshold) Watermarker](https://github.com/resemble-ai/perth) - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
-
-
-## Watermark extraction
-
-You can look for the watermark using the following script.
-
-```python
-import perth
-import librosa
-
-AUDIO_PATH = "YOUR_FILE.wav"
-
-# Load the watermarked audio
-watermarked_audio, sr = librosa.load(AUDIO_PATH, sr=None)
+.\Scripts\activate
+```
 
-# Initialize watermarker (same as used for embedding)
-watermarker = perth.PerthImplicitWatermarker()
+```
+python.exe -m pip install --upgrade pip
+```
 
-# Extract watermark
-watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
-print(f"Extracted watermark: {watermark}")
-# Output: 0.0 (no watermark) or 1.0 (watermarked)
+```
+pip install uv
 ```
 
+Next, make sure appropriate versons of torch, torchaudio, and CUDA are installed.
 
-# Official Discord
+```
+uv pip install -r requirements.txt
+```
 
-👋 Join us on [Discord](https://discord.gg/rJq9cRJBJ6) and let's build something awesome together!
+```
+pip install git+https://github.com/BBC-Esq/chatterbox-light.git --no-deps
+```
 
-# Disclaimer
-Don't use this model to do bad things. Prompts are sourced from freely available data on the internet.
+Enjoy!
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,31 +1,32 @@
 [project]
 name = "chatterbox-tts"
-version = "0.1.2"
+version = "0.1.6"
 description = "Chatterbox: Open Source TTS and Voice Conversion by Resemble AI"
 readme = "README.md"
-requires-python = ">=3.8"
+requires-python = ">=3.10"
 license = {file = "LICENSE"}
 authors = [
     {name = "resemble-ai", email = "engineering@resemble.ai"}
 ]
 dependencies = [
-    "numpy~=1.26.0",
-    "resampy==0.4.3",
-    "librosa==0.11.0",
+    "numpy",
+    "torch",
+    "torchaudio",
+    "transformers>=4.46.0",
+    "diffusers",
+    "safetensors",
     "s3tokenizer",
-    "torch==2.6.0",
-    "torchaudio==2.6.0",
-    "transformers==4.46.3",
-    "diffusers==0.29.0",
-    "resemble-perth==1.0.1",
-    "omegaconf==2.3.0",
-    "conformer==0.3.2",
-    "safetensors==0.5.3"
+    "einops",
+    "scipy",
+    "pyloudnorm",
+    "soundfile",
+    "tokenizers",
+    "tqdm",
 ]
 
 [project.urls]
-Homepage = "https://github.com/resemble-ai/chatterbox"
-Repository = "https://github.com/resemble-ai/chatterbox"
+Homepage = "https://github.com/BBC-Esq/chatterbox-light"
+Repository = "https://github.com/BBC-Esq/chatterbox-light"
 
 [build-system]
 requires = ["setuptools>=61.0"]

diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,10 @@
+s3tokenizer
+transformers
+diffusers
+safetensors
+pyloudnorm
+numpy
+hf_xet
+soundfile
+# remember to pip install this below after requirements.txt
+# git+https://github.com/BBC-Esq/chatterbox-light.git --no-deps
diff --git a/src/chatterbox/__init__.py b/src/chatterbox/__init__.py
@@ -1,2 +1,11 @@
-from .tts import ChatterboxTTS
-from .vc import ChatterboxVC
+try:
+    from importlib.metadata import version
+except ImportError:
+    from importlib_metadata import version  # For Python <3.8
+
+__version__ = version("chatterbox-tts")
+
+
+from .tts import ChatterboxTTS
+from .vc import ChatterboxVC
+from .mtl_tts import ChatterboxMultilingualTTS, SUPPORTED_LANGUAGES
diff --git a/src/chatterbox/models/__init__.py b/src/chatterbox/models/__init__.py
diff --git a/src/chatterbox/models/s3gen/configs.py b/src/chatterbox/models/s3gen/configs.py
@@ -0,0 +1,10 @@
+from ..utils import AttrDict
+
+CFM_PARAMS = AttrDict({
+    "sigma_min": 1e-06,
+    "solver": "euler",
+    "t_scheduler": "cosine",
+    "training_cfg_rate": 0.2,
+    "inference_cfg_rate": 0.7,
+    "reg_loss_type": "l1"
+})
diff --git a/src/chatterbox/models/s3gen/const.py b/src/chatterbox/models/s3gen/const.py
@@ -1 +1,2 @@
 S3GEN_SR = 24000
+S3GEN_SIL = 4299
diff --git a/src/chatterbox/models/s3gen/decoder.py b/src/chatterbox/models/s3gen/decoder.py
@@ -1,16 +1,3 @@
-# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Zhihao Du)
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
@@ -20,15 +7,13 @@
 from .matcha.decoder import SinusoidalPosEmb, Block1D, ResnetBlock1D, Downsample1D, \
     TimestepEmbedding, Upsample1D
 from .matcha.transformer import BasicTransformerBlock
+from .utils.intmeanflow import get_intmeanflow_time_mixer
 
 
 def mask_to_bias(mask: torch.Tensor, dtype: torch.dtype) -> torch.Tensor:
     assert mask.dtype == torch.bool
     assert dtype in [torch.float32, torch.bfloat16, torch.float16]
     mask = mask.to(dtype)
-    # attention mask bias
-    # NOTE(Mddct): torch.finfo jit issues
-    #     chunk_masks = (1.0 - chunk_masks) * torch.finfo(dtype).min
     mask = (1.0 - mask) * -1.0e+10
     return mask
 
@@ -95,8 +80,6 @@ def forward(self, x: torch.Tensor):
         x = F.pad(x, self.causal_padding)
         x = super(CausalConv1d, self).forward(x)
         return x
-
-
 class ConditionalDecoder(nn.Module):
     def __init__(
         self,
@@ -110,13 +93,11 @@ def __init__(
         num_mid_blocks=12,
         num_heads=8,
         act_fn="gelu",
+        meanflow=False,
     ):
-        """
-        This decoder requires an input with the same shape of the target. So, if your text content
-        is shorter or longer than the outputs, please re-sampling it before feeding to the decoder.
-        """
         super().__init__()
         channels = tuple(channels)
+        self.meanflow = meanflow
         self.in_channels = in_channels
         self.out_channels = out_channels
         self.causal = causal
@@ -127,15 +108,15 @@ def __init__(
             time_embed_dim=time_embed_dim,
             act_fn="silu",
         )
+
         self.down_blocks = nn.ModuleList([])
         self.mid_blocks = nn.ModuleList([])
         self.up_blocks = nn.ModuleList([])
 
-        # NOTE jrm: `static_chunk_size` is missing?
         self.static_chunk_size = 0
 
         output_channel = in_channels
-        for i in range(len(channels)):  # pylint: disable=consider-using-enumerate
+        for i in range(len(channels)):
             input_channel = output_channel
             output_channel = channels[i]
             is_last = i == len(channels) - 1
@@ -215,6 +196,14 @@ def __init__(
         self.final_block = CausalBlock1D(channels[-1], channels[-1]) if self.causal else Block1D(channels[-1], channels[-1])
         self.final_proj = nn.Conv1d(channels[-1], self.out_channels, 1)
         self.initialize_weights()
+        self.time_embed_mixer = None
+        if self.meanflow:
+            self.time_embed_mixer = get_intmeanflow_time_mixer(time_embed_dim)
+
+
+    @property
+    def dtype(self):
+        return self.final_proj.weight.dtype
 
     def initialize_weights(self):
         for m in self.modules():
@@ -230,27 +219,16 @@ def initialize_weights(self):
                 if m.bias is not None:
                     nn.init.constant_(m.bias, 0)
 
-    def forward(self, x, mask, mu, t, spks=None, cond=None):
-        """Forward pass of the UNet1DConditional model.
-
-        Args:
-            x (torch.Tensor): shape (batch_size, in_channels, time)
-            mask (_type_): shape (batch_size, 1, time)
-            t (_type_): shape (batch_size)
-            spks (_type_, optional): shape: (batch_size, condition_channels). Defaults to None.
-            cond (_type_, optional): placeholder for future use. Defaults to None.
-
-        Raises:
-            ValueError: _description_
-            ValueError: _description_
-
-        Returns:
-            _type_: _description_
-        """
-
+    def forward(self, x, mask, mu, t, spks=None, cond=None, r=None):
         t = self.time_embeddings(t).to(t.dtype)
         t = self.time_mlp(t)
 
+        if self.meanflow:
+            r = self.time_embeddings(r).to(t.dtype)
+            r = self.time_mlp(r)
+            concat_embed = torch.cat([t, r], dim=1)
+            t = self.time_embed_mixer(concat_embed)
+
         x = pack([x, mu], "b * t")[0]
 
         if spks is not None:
@@ -265,7 +243,6 @@ def forward(self, x, mask, mu, t, spks=None, cond=None):
             mask_down = masks[-1]
             x = resnet(x, mask_down, t)
             x = rearrange(x, "b c t -> b t c").contiguous()
-            # attn_mask = torch.matmul(mask_down.transpose(1, 2).contiguous(), mask_down)
             attn_mask = add_optional_chunk_mask(x, mask_down.bool(), False, False, 0, self.static_chunk_size, -1)
             attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
             for transformer_block in transformer_blocks:
@@ -275,7 +252,7 @@ def forward(self, x, mask, mu, t, spks=None, cond=None):
                     timestep=t,
                 )
             x = rearrange(x, "b t c -> b c t").contiguous()
-            hiddens.append(x)  # Save hidden states for skip connections
+            hiddens.append(x)
             x = downsample(x * mask_down)
             masks.append(mask_down[:, :, ::2])
         masks = masks[:-1]
@@ -284,7 +261,6 @@ def forward(self, x, mask, mu, t, spks=None, cond=None):
         for resnet, transformer_blocks in self.mid_blocks:
             x = resnet(x, mask_mid, t)
             x = rearrange(x, "b c t -> b t c").contiguous()
-            # attn_mask = torch.matmul(mask_mid.transpose(1, 2).contiguous(), mask_mid)
             attn_mask = add_optional_chunk_mask(x, mask_mid.bool(), False, False, 0, self.static_chunk_size, -1)
             attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
             for transformer_block in transformer_blocks:
@@ -301,7 +277,6 @@ def forward(self, x, mask, mu, t, spks=None, cond=None):
             x = pack([x[:, :, :skip.shape[-1]], skip], "b * t")[0]
             x = resnet(x, mask_up, t)
             x = rearrange(x, "b c t -> b t c").contiguous()
-            # attn_mask = torch.matmul(mask_up.transpose(1, 2).contiguous(), mask_up)
             attn_mask = add_optional_chunk_mask(x, mask_up.bool(), False, False, 0, self.static_chunk_size, -1)
             attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
             for transformer_block in transformer_blocks:
@@ -314,4 +289,4 @@ def forward(self, x, mask, mu, t, spks=None, cond=None):
             x = upsample(x * mask_up)
         x = self.final_block(x, mask_up)
         output = self.final_proj(x * mask_up)
-        return output * mask
+        return output * mask
diff --git a/src/chatterbox/models/s3gen/f0_predictor.py b/src/chatterbox/models/s3gen/f0_predictor.py
@@ -1,16 +1,3 @@
-# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Kai Hu)
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
 import torch
 import torch.nn as nn
 from torch.nn.utils.parametrizations import weight_norm