Skip to content

πŸš€ Open-source project for creating high-quality AI TTS-narrated audiobooks at home using models like Zonos, Kokoro-82M, or services like Deepgram and Eleven Labs. Tested on Apple Silicon M1 (32GB RAM). πŸ“–πŸŽ§

License

Notifications You must be signed in to change notification settings

sergenes/runandread-audiobook

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RunAndRead-Audiobook-Pipeline

License: MIT Python 3.9+ Platform MLX TTS

Related Projects: [RunAndRead-iOS] | [RunAndRead-Android]

Alt text

Overview

RunAndRead-Audiobook is an open-source project aimed at generating high-quality text-to-speech (TTS) audiobooks using open-source models like Zyphra/Zonos.

The ultimate goal is to make Run & Read, the audiobook player app, sound more natural by using high-quality voices. Currently, it relies on the standard voices embedded in Apple and Android devices, which are still not perfect. Starting from Android v1.5 (6) and iOS v1.6 (18), Run & Read supports MP3 audiobooks generated using the RANDR pipeline in this repository. See instructions here.

Apps

App Store: Run & Read for Apple Devices
Google Play: Run & Read for Android

QR codes

Β Β Β 
---

Create Audiobooks with AI (RANDR format)

Generate high-quality audiobooks at home using open-source AI models! We’ve built a pipeline using MLX-AUDIO to create audiobooks in the RANDR format, optimized for playback in the Run & Read app.

Dedicated document with step-by-step instructions

Features

  • Pipeline for generating audiobooks compatible with the Run & Read app.
  • Convert EPUB to JSON for text extraction.
  • Generate audio using Zonos TTS or Kokoro-TTS (AUDIO-MLX).
  • Clone voices from an MP3 sample.
  • Play audio clips sequentially while displaying text in the terminal.
  • Merge audio clips into one file.
  • Zyphra and Deepgram API support for cloud-based TTS.
  • Wrap produced audio and JSON files into a ZIP readable by the Run & Read app.
  • Transfer audio files to a mobile phone and play them in the Run & Read app.

Planned

  • Estimate local vs. cloud generation cost.
  • On-device TTS for Android/iOS.

Audio Samples

Here are some audiobook samples generated using RunAndRead-Audiobook with Zonos TTS voice cloning:

[Sample 1 - Alice in Wonderland]

You can find examples under the audio/pg11/ folder, and generate your own samples using the steps outlined in the Usage section below.


Dependencies & Technologies

  • Python 3.9+
  • Zyphra/Zonos (open-source TTS engine)
  • ffmpeg (audio conversion)
  • EbookLib (EPUB parsing)
  • PyAudio / playsound (for playback)
  • yt-dlp (to download MP3 files from YouTube for voice cloning)

Installation

1) Install Python Dependencies

pip install -r requirements.txt

2) Set Up Zyphra/Zonos

Follow the official installation instructions from Zyphra/Zonos. Using a uv virtual environment is recommended for running RunAndRead scripts. After installing the Zonos project, run the sample.py script:

uv run sample.py

This will download the "Zyphra/Zonos-v0.1-transformer" base model from Hugging Face and store it in your environment.

3) Set Up ffmpeg

4) Download a Voice Sample from YouTube

To train a Zonos voice clone, you'll need an MP3 sample of the speaker. A 10-20 minute video with a single speaker (e.g., a tutorial or audiobook) is recommended. You can download an MP3 track from YouTube using yt-dlp:

yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=MkLBNUMc26Y" -o "assets/exampleaudio.mp3"

This exampleaudio.mp3 file will be used by the Zonos model to fine-tune the voice sample before actual synthesis.


Usage

Step 1: Convert EPUB to JSON

First, run this script with 0 as the third parameter:

python epub_to_json.py epub/pg11.epub library/pg11.json 0

Check the terminal output to find how many lines should be skipped, then rerun the script with the number of the first line to keep:

python epub_to_json.py epub/pg11.epub library/pg11.json 10

This ensures that the book starts from the correct position, e.g.:

10: CHAPTER I. Down the Rabbit-Hole

Note: Without an NVIDIA GPU, converting an entire book to audio takes a long time. A 30-second audio clip takes approximately 3 minutes to generate on macbook pro, m1. A full book can take dozens of hours. For example, Alice’s Adventures in Wonderland is 3 hours long, meaning 18 hours of processing on a MacBook Pro with an M1 processor. However, the make_abook script can be interrupted at any time, and it will resume from the position where it was stopped.

Step 2: Generate TTS Audio Files

uv run python make_abook.py library/pg21279.json assets/kurt_v.mp3

Step 3: Play Audiobook in CLI

python play_audio.py audio/pg11 mp3

Step 4: Merge a set of audio clips into one audio file

python merge_audio_clips.py library/pg11.json audio/pg11 mp3

Step 5: Prepare audio clip for YouTube/LinkedIn

# YouTube
ffmpeg -loop 1 -i assets/ic_launcher.png -i audio/pg11/merged_output.mp3 -c:v libx264 -c:a aac -b:a 192k -shortest output.mp4 
# LinkedIn
ffmpeg -loop 1 -i appGoogle.png -i merged_output.mp3 -vf "scale=1080:1080,format=yuv420p" -c:v libx264 -tune stillimage -c:a aac -b:a 192k -shortest output.mp4

# X
ffmpeg -loop 1 -i appGoogle.png -i merged_output.mp3 -vf "scale=1080:1080,format=yuv420p" -c:v libx264 -tune stillimage -c:a aac -b:a 192k -pix_fmt yuv420p -shortest output.mp4

Step 6: Set up REST Zyphra/Deepgram/OpenAI SDK

# Zyphra
export ZYPHRA_API_KEY="your-zyphra-api-key"
python zyphra_api.py library/pg11.json
# Deepgram
export DEEPGRAM_API_KEY="your-deepgram-api-key"
python deepgram_api.py library/pg11.json
# OpenAI MINI TTS
export OPENAI_API_KEY="your-open-api-key"
python make_abook_open_ai.py library/pg11.json

Step 7: Set up MLX-AUDIO (cloned local repo)

pip install -e ~/projects/voice/mlx-audio

Note: Kokoro-82M TTS model skips names and other out-of-dictionary (OOD) words due to its reliance on an external grapheme-to-phoneme (g2p) conversion tool called espeak-ng2. This behavior occurs when espeak-ng is not properly installed or detected by the system. To prevent Kokoro-82M from skipping names and OOD words, you need to install espeak-ng

echo 'export ESPEAK_DATA_PATH=/opt/homebrew/share/espeak-ng-data' >> ~/.zshrc
source ~/.zshrc

# make audio book
python make_abook_mlx.py library/pg2680.json 

Step 8: Make RANDR Audiobook

python make_randr.py audio/pg20203/

Pipeline Schema

flowchart LR
    A[EPUB] --> B[epub_to_json.py]
    B --> C[JSON book]
    C --> D[make_abook.py / make_abook_mlx.py / make_abook_open_ai.py / zyphra_api.py / deepgram_api.py]
    D --> E[Audio clips]
    E --> F[play_audio.py]
    E --> G[merge_audio_clips.py]
    C --> H[make_randr.py]
    E --> H
    H --> I[RANDR zip]
Loading

Project Structure

runandread-audiobook/
β”œβ”€β”€ epub_to_json.py      # Extracts text from EPUB into JSON
β”œβ”€β”€ make_abook.py        # Converts text into audio files with Zonos TTS
β”œβ”€β”€ make_abook_mlx.py    # Converts text into audio files using the Kokoro-82M TTS model with mlx-audio (optimized for Apple M-series processors).
β”œβ”€β”€ make_randr.py        # Wrap the produced audio and JSON files into a ZIP file readable by the Run & Read app.
β”œβ”€β”€ play_audio.py        # Play audio clips sequentially while displaying text
β”œβ”€β”€ merge_audio_clips.py # Merges audio files into one and generates a timestamped JSON file
β”œβ”€β”€ word_tokens_tools.py # Utility to normalize the text before pass it to the TTS
β”œβ”€β”€ test_scan_next.py    # Unit tests to make sure text normalization works as expected
β”œβ”€β”€ zyphra_api.py        # Converts text into audio files with Zyphra SDK/Rest API API
β”œβ”€β”€ deepgram_api.py      # Converts text into audio files with Deepgram SDK/Rest API API
β”œβ”€β”€ make_abook_open_ai.py# Converts text into audio files with OpenAI TTS
β”œβ”€β”€ assets/              # Stores MP3 files for voice cloning
β”œβ”€β”€ epub/                # EPUB books from the Gutenberg Project
β”œβ”€β”€ audio/               # Output audio files
β”œβ”€β”€ audiobooks/          # RANDR audiobooks samples
     β”œβ”€β”€ pg2680.randr    # Meditations by Emperor of Rome Marcus Aurelius
     β”œβ”€β”€ pg20203.randr   # Autobiography of Benjamin Franklin
β”œβ”€β”€ library/             # Output JSON book files
β”œβ”€β”€ README.md            # Documentation
β”œβ”€β”€ requirements.txt     # Dependencies
└── LICENSE              # Open-source license

Contributions

Contributions are welcome! Feel free to open an issue or submit a pull request.


References & Kudos

  • Zonos - Open-source TTS model.
  • AUDIO-MLX - A TTS and STS library built on Apple's MLX framework.
  • Kokoro-TTS - An open-weight TTS model with 82 million parameters.
  • Deepgram - Commercial cloud-based TTS.
  • EbookLib - EPUB parsing in Python.
  • yt-dlp - YouTube audio downloader for voice cloning.
  • Gutenberg Project - A library of over 75,000 free eBooks.
  • Python Simplified, MariyaSha - Python Simplified. Kudos to Mariya for her beautiful voice that I did clone from one of her videos.

Contact

  • Sergey N - Connect and follow me on LinkedIn.

License

This project is open-source and available under the MIT License.

About

πŸš€ Open-source project for creating high-quality AI TTS-narrated audiobooks at home using models like Zonos, Kokoro-82M, or services like Deepgram and Eleven Labs. Tested on Apple Silicon M1 (32GB RAM). πŸ“–πŸŽ§

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages