07 Feb 21:31

mudler

944874d

v3.11.0 Latest

Latest

🎉 LocalAI 3.11.0 Release! 🚀

LocalAI 3.11.0 is a massive update for Audio and Multimodal capabilities.

We are introducing Realtime Audio Conversations, a dedicated Music Generation UI, and a massive expansion of ASR (Speech-to-Text) and TTS backends. Whether you want to talk to your AI, clone voices, transcribe with speaker identification, or generate songs, this release has you covered.

Check out the highlights below!

📌 TL;DR

Feature	Summary
Realtime Audio	Native support for audio conversations, enabling fluid voice interactions similar to OpenAI's Realtime API. Documentation
Music Generation UI	New UI interface for MusicGen (Ace-Step), allowing you to generate music from text prompts directly in the browser.
New ASR Backends	Added WhisperX (with Speaker Diarization), VibeVoice, Qwen-ASR, and Nvidia NeMo.
TTS Streaming	Text-to-Speech now supports streaming mode for lower latency responses. (VoxCPM only for now)
vLLM Omni	Added support for vLLM Omni, expanding our high-performance inference capabilities.
Speaker Diarization	Native support for identifying different speakers in transcriptions via WhisperX.
Hardware Expansion	Expanded build support for CUDA 12/13, L4T (Jetson), SBSA, and better Metal (Apple Silicon) integration with MLX backends
Breaking Changes	ExLlama (deprecated) and Bark (unmaintained) backends have been removed.

🚀 New Features & Major Enhancements

🎙️ Realtime Audio Conversations

LocalAI 3.11.0 introduces native support for Realtime Audio Conversations.

Enables fluid, low-latency voice interaction with agents.
Logic handled directly within the LocalAI pipeline for seamless audio-in/audio-out workflows.
Support for STT/TTS and voice-to-voice models (experimental)
Support for tool calls

🗣️ Talk to your LocalAI: This brings us one step closer to a fully local, voice-native assistant experience compatible with standard client implementations.

Check here for detailed documentation.

🎵 Music Generation UI & Ace-Step

We have added a dedicated interface for music generation!

New Backend: Support for Ace-Step (MusicGen) via the ace-step backend.
Web UI Integration: Generate musical clips directly from the LocalAI Web UI.
Simple text-to-music workflow (e.g., "Lo-fi hip hop beat for studying").

Screenshot 2026-02-07 at 23-32-00 LocalAI - Generate sound with ace-step-turbo

🎧 Massive ASR (Speech-to-Text) Expansion

This release significantly broadens our transcription capabilities with four new backends:

WhisperX: Provides fast transcription with Speaker Diarization (identifying who is speaking).
VibeVoice: Now supports also ASR alongside TTS.
Qwen-ASR: Support for Qwen's powerful speech recognition models.
Nvidia NeMo: Initial support for NeMo ASR.

🗣️ TTS Streaming & New Voices

Text-to-Speech gets a speed boost and new options:

Streaming Support: TTS endpoints now support streaming, reducing the "time-to-first-audio" significantly.
VoxCPM: Added support for the VoxCPM backend.
Qwen-TTS: Added support for Qwen-TTS models
Piper Voices: Added most remaining Piper voices from Hugging Face to the gallery.

🛠️ Hardware & Backend Updates

vLLM Omni: A new backend integration for vLLM Omni models.
Extended Platform Support: Major work on MLX to improve compatibility across CUDA 12, CUDA 13, L4T (Nvidia Jetson), SBSA, and macOS Metal.
GGUF Cleanup: Dropped redundant VRAM estimation logic for GGUF loading, relying on more accurate internal measurements.

⚠️ Breaking Changes

To keep the project lean and maintainable, we have removed some older backends:

ExLlama: Removed (deprecated in favor of newer loaders like ExLlamaV2 or llama.cpp).
Bark: Removed (the upstream project is unmaintained; we recommend using the new TTS alternatives).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link: https://github.com/mudler/LocalRecall

❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

✅ Star the repo
💬 Contribute code, docs, or feedback
📣 Share with others

Your support keeps this stack alive.

✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Breaking Changes 🛠

chore(exllama): drop backend now almost deprecated by @mudler in #8186

Bug fixes 🐛

fix(ui): correctly display selected image model by @dedyf5 in #8208
fix(ui): take account of reasoning in token count calculation by @mudler in #8324
fix: drop gguf VRAM estimation (now redundant) by @mudler in #8325
fix(api): Add missing field in initial OpenAI streaming response by @acon96 in #8341
fix(realtime): Include noAction function in prompt template and handle tool_choice by @richiejp in #8372
fix: filter GGUF and GGML files from model list by @Yaroslav98214 in #8397
fix(qwen-asr): Remove contagious slop (DEFAULT_GOAL) from Makefile by @richiejp in #8431

Exciting New Features 🎉

feat(vllm-omni): add new backend by @mudler in #8188
feat(vibevoice): add ASR support by @mudler in #8222
feat: add VoxCPM tts backend by @mudler in #8109
feat(realtime): Add audio conversations by @richiejp in #6245
feat(qwen-asr): add support to qwen-asr by @mudler in #8281
feat(tts): add support for streaming mode by @mudler in #8291
feat(api): Add transcribe response format request parameter & adjust STT backends by @nanoandrew4 in #8318
feat(whisperx): add whisperx backend for transcription with speaker diarization by @eureka928 in #8299
feat(mlx): Add support for CUDA12, CUDA13, L4T, SBSA and CPU by @mudler in #8380
feat(musicgen): add ace-step and UI interface by @mudler in #8396
fix(api)!: Stop model prior to deletion by @nanoandrew4 in #8422
feat(nemo): add Nemo (only asr for now) backend by @mudler in #8436

🧠 Models

chore(model gallery): add qwen3-tts to model gallery by @mudler in #8187
chore(model gallery): Add most of not yet present Piper voices from Hugging Face by @rampa3 in #8202
chore: drop bark which is unmaintained by @mudler in #8207
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8220
chore(model gallery): Add entry for Mistral Small 3.1 with mmproj by @rampa3 in https://git...

Contributors

richiejp, mudler, and 9 other contributors

Assets 9

23 Jan 14:21

mudler

v3.10.1

923ebbb

v3.10.1

This is a small patch release intended to provide bugfixes and minor polishment, along, we also added support to Qwen-TTS that was just released yesterday.

Fix reasoning detection on reasoning and instruct models
Support reasoning blocks with openresponses
API fixes to correctly run LTX-2
Support Qwen3-TTS!

What's Changed

Bug fixes 🐛

fix(reasoning): support models with reasoning without starting thinking tag by @mudler in #8132
fix(tracing): Create trace buffer on first request to enable tracing at runtime by @richiejp in #8148
fix(videogen): drop incomplete endpoint, add GGUF support for LTX-2 by @mudler in #8160

Exciting New Features 🎉

feat(openresponses): Support reasoning blocks by @mudler in #8133
feat: detect thinking support from backend automatically if not explicitly set by @mudler in #8167
feat(qwen-tts): add Qwen-tts backend by @mudler in #8163

🧠 Models

chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8128
chore(model gallery): add flux 2 and flux 2 klein by @mudler in #8141
chore(model-gallery): ⬆️ update checksum by @localai-bot in #8153
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8157
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8170

👒 Dependencies

chore(deps): bump github.com/mudler/cogito from 0.7.2 to 0.8.1 by @dependabot[bot] in #8124

Other Changes

feat(swagger): update swagger by @localai-bot in #8098
chore: ⬆️ Update ggml-org/llama.cpp to 287a33017b32600bfc0e81feeb0ad6e81e0dd484 by @localai-bot in #8100
chore: ⬆️ Update leejet/stable-diffusion.cpp to 2efd19978dd4164e387bf226025c9666b6ef35e2 by @localai-bot in #8099
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8120
chore: ⬆️ Update leejet/stable-diffusion.cpp to a48b4a3ade9972faf0adcad47e51c6fc03f0e46d by @localai-bot in #8121
chore: ⬆️ Update ggml-org/llama.cpp to 959ecf7f234dc0bc0cd6829b25cb0ee1481aa78a by @localai-bot in #8122
chore(deps): Bump llama.cpp to '1c7cf94b22a9dc6b1d32422f72a627787a4783a3' by @mudler in #8136
chore: drop noisy logs by @mudler in #8142
chore: ⬆️ Update ggml-org/llama.cpp to ad8d85bd94cc86e89d23407bdebf98f2e6510c61 by @localai-bot in #8145
chore: ⬆️ Update ggml-org/whisper.cpp to 7aa8818647303b567c3a21fe4220b2681988e220 by @localai-bot in #8146
feat(swagger): update swagger by @localai-bot in #8150
chore(diffusers): add 'av' to requirements.txt by @mudler in #8155
chore: ⬆️ Update leejet/stable-diffusion.cpp to 329571131d62d64a4f49e1acbef49ae02544fdcd by @localai-bot in #8152
chore: ⬆️ Update ggml-org/llama.cpp to c301172f660a1fe0b42023da990bf7385d69adb4 by @localai-bot in #8151
chore: ⬆️ Update ggml-org/llama.cpp to a5eaa1d6a3732bc0f460b02b61c95680bba5a012 by @localai-bot in #8165
chore: ⬆️ Update leejet/stable-diffusion.cpp to 5e4579c11d0678f9765463582d024e58270faa9c by @localai-bot in #8166

Full Changelog: v3.10.0...v3.10.1

Contributors

richiejp, mudler, and 2 other contributors

Assets 9

18 Jan 21:00

mudler

v3.10.0

5f403b1

v3.10.0

🎉 LocalAI 3.10.0 Release! 🚀

LocalAI 3.10.0 is big on agent capabilities, multi-modal support, and cross-platform reliability.

We've added native Anthropic API support, launched a new Video Generation UI, introduced Open Responses API compatibility, and enhanced performance with a unified GPU backend system.

For a full tour, see below!

📌 TL;DR

Feature	Summary
Anthropic API Support	Fully compatible `/v1/messages` endpoint for seamless drop-in replacement of Claude.
Open Responses API	Native support for stateful agents with tool calling, streaming, background mode, and multi-turn conversations, passing all official acceptance tests.
Video & Image Generation Suite	New video gen UI + LTX-2 support for text-to-video and image-to-video.
Unified GPU Backends	GPU libraries (CUDA, ROCm, Vulkan) packaged inside backend containers — works out of the box on Nvidia, AMD, and ARM64 (Experimental).
Tool Streaming & XML Parsing	Full support for streaming tool calls and XML-formatted tool outputs.
System-Aware Backend Gallery	Only see backends your system can run (e.g., hide MLX on Linux).
Crash Fixes	Prevents crashes on AVX-only CPUs (Intel Sandy/Ivy Bridge) and fixes VRAM reporting on AMD GPUs.
Request Tracing	Debug agents & fine-tuning with memory-based request/response logging.
Moonshine Backend	Ultra-fast transcription engine for low-end devices.
Pocket-TTS	Lightweight, high-fidelity text-to-speech with voice cloning.
Vulkan arm64 builds	We now build backends and images for vulkan on arm64 as well

🚀 New Features & Major Enhancements

🤖 Open Responses API: Build Smarter, Autonomous Agents

LocalAI now supports the OpenAI Responses API, enabling powerful agentic workflows locally.

Stateful conversations via response_id — resume and manage long-running agent sessions.
Background mode: Run agents asynchronously and fetch results later.
Streaming support for tools, images, and audio.
Built-in tools: Web search, file search, and computer use (via MCP integrations).
Multi-turn interaction with dynamic context and tool use.

✅ Ideal for developers building agents that can browse, analyze files, or interact with systems — all on your local machine.

🔧 How to Use:

Set response_id in your request to maintain session state across calls.

Use background: true to run agents asynchronously.

Retrieve results via GET /api/v1/responses/{response_id}.

Enable streaming with stream: true to receive partial responses and tool calls in real time.

📌 Tip: Use response_id to build agent orchestration systems that persist context and avoid redundant computation.

Our support passes all the official acceptance tests:

🧠 Anthropic Messages API: Clone Claude Locally

LocalAI now fully supports the Anthropic messages API.

Use https://api.localai.host/v1/messages as a drop-in replacement for Claude.
Full tool/function calling support, just like OpenAI.
Streaming and non-streaming responses.
Compatible with anthropic-sdk-go, LangChain, and other tooling.

🔥 Perfect for teams migrating from Anthropic to local inference with full feature parity.

🎥 Video Generation: From Text to Video in the Web UI

New dedicated video generation page with intuitive controls.
LTX-2 is supported
Supports text-to-video and image-to-video workflows.
Built on top of diffusers with full compatibility.

📌 How to Use:

Go to /video in the web UI.

Enter a prompt (e.g., "A cat walking on a moonlit rooftop").

Optionally upload an image for image-to-video generation.

Adjust parameters like fps, num_frames, and guidance_scale.

⚙️ Unified GPU Backends: Acceleration Works Out of the Box

A major architectural upgrade: GPU libraries (CUDA, ROCm, Vulkan) are now packaged inside backend containers.

Single image: Now you don't need anymore to pull a specific image for your GPU. Any image works regardless if you have a GPU or not.
No more manual GPU driver setup — just run the image and get acceleration.
Works on Nvidia (CUDA), AMD (ROCm), and ARM64 (Vulkan).
Vulkan arm64 builds enabled
Reduced image complexity, faster builds, and consistent performance.

🚀 This means latest/master images now support GPU acceleration on all platforms — no extra config!

Note: this is experimental, please help us by filing an issue if something doesn't work!

🧩 Tool Streaming & Advanced Parsing

Enhance your agent workflows with richer tool interaction.

Streaming tool calls: Receive partial tool arguments in real time (e.g., input_json_delta).
XML-style tool call parsing: Models that return tools in XML format (<function>...</function>) are now properly parsed alongside text.
Works across all backends (llama.cpp, vLLM, diffusers, etc.).

💡 Enables more natural, real-time interaction with agents that use structured tool outputs.

🌐 System-Aware Backend Gallery: Only Compatible Backends Show

The backend gallery now shows only backends your system can run.

Auto-detects system capabilities (CPU, GPU, MLX, etc.).
Hides unsupported backends (e.g., MLX on Linux, CUDA on AMD).
Shows detected capabilities in the hero section.

🎤 New TTS Backends: Pocket-TTS

Add expressive voice generation to your apps with Pocket-TTS.

Real-time text-to-speech with voice cloning support (requires HF login).
Lightweight, fast, and open-source.
Available in the model gallery.

🗣️ Perfect for voice agents, narrators, or interactive assistants.
❗ Note: Voice cloning requires HF authentication and a registered voice model.

🔍 Request Tracing: Debug Your Agents

Trace requests and responses in memory — great for fine-tuning and agent debugging.

Enable via runtime setting or API.
Log stored in memory, dropped after max size.
Fetch logs via GET /api/v1/trace.
Export to JSON for analysis.

🪄 New 'Reasoning' Field: Extract Thinking Steps

LocalAI now automatically detects and extracts thinking tags from model output.

Supports both SSE and non-SSE modes.
Displays reasoning steps in the chat UI (under "Thinking" tab).
Fixes issue where thinking content appeared as part of final answer.

🚀 Moonshine Backend: Faster Transcription for Low-End Devices

Add Moonshine, an ONNX-based transcription engine, for fast, lightweight speech-to-text.

Optimized for low-end devices (Raspberry Pi, older laptops).
One of the fastest transcription engines available.
Supports live transcription.

🛠️ Fixes & Stability Improvements

🔧 Prevent BMI2 Crashes on AVX-Only CPUs

Fixed crashes on older Intel CPUs (Ivy Bridge, Sandy Bridge) that lack BMI2 instructions.

Now safely falls back to llama-cpp-fallback (SSE2 only).
No more EOF errors during model warmup.

✅ Ensures LocalAI runs smoothly on older hardware.

📊 Fix Swapped VRAM Usage on AMD GPUs

Correctly parses rocm-smi output: used and total VRAM are now displayed correctly.

Fixes misreported memory usage on dual-Radeon setups.
Handles HIP_VISIBLE_DEVICES properly (e.g., when using only discrete GPU).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link:

Contributors

richiejp, Nold360, and 9 other contributors

Assets 7

24 Dec 14:31

mudler

v3.9.0

aadec0b

v3.9.0

Xmas-release 🎅 LocalAI 3.9.0! 🚀

LocalAI 3.9.0 is focused on stability, resource efficiency, and smarter agent workflows. We've addressed critical issues with model loading, improved system resource management, and introduced a new Agent Jobs panel for scheduling and managing background agentic tasks. Whether you're running models locally or orchestrating complex agent workflows, this release makes it faster, more reliable, and easier to manage.

📌 TL;DR

Feature	Summary
Agent Jobs Panel	Schedule and run background tasks with cron or via API — perfect for automated workflows.
Smart Memory Reclaimer	Automatically frees up GPU/VRAM by evicting least recently used models when memory is low.
LRU Model Eviction	Models are automatically unloaded from memory based on usage patterns to prevent crashes.
MLX & CUDA 13 Support	New model backends and enhanced GPU compatibility for modern hardware.
UI Polish & Fixes	Cleaned-up navigation, fixed layout overflow, and various improvements.
Vibevoice	Added support for the vibevoice backend!

🚀 New Features

🤖 Agent Jobs Panel: Schedule & Automate Tasks

LocalAI 3.9.0 introduces a new Agent Jobs panel in the web UI and API, allowing you to create, run, and schedule agentic tasks in the background that can be started programmatically via API or from the Web interface.

Run agent prompts on a schedule using cron syntax, or via API.
Agents are defined via the model settings, supporting MCP.
Trigger jobs via API for integration into CI/CD or external tools.
Optionally send results to a webhook for post-processing.
Templates and prompts can be dynamically populated with variables.

✅ Use cases: Daily reports, CI integration, automated data processing, scheduled model evaluations.

Screenshot 2025-12-24 at 15-26-32 LocalAI - Agent Jobs

🧠 Smart Memory Reclaimer: Auto-Optimize GPU Resources

We’ve introduced a new Memory Reclaimer that monitors system memory usage and automatically frees up GPU/VRAM when needed.

Screenshot 2025-12-24 at 15-25-30 LocalAI API - 8b3e0eb (8b3e0ebf8aab4071ef7721121f04081c32a5c9da)

Tracks memory consumption across all backends.
When usage exceeds a configured threshold, it evicts the least recently used (LRU) models.
Prevents out-of-memory crashes and keeps your system stable during high load.

This is a step toward adaptive resource management, future versions will expand this with more advanced policies and giving more control.

🔁 LRU Model Eviction: Intelligent Model Management

Building on the new reclaimer, LocalAI now supports LRU (Least Recently Used) eviction for loaded models.

Screenshot 2025-12-24 at 15-27-24 LocalAI - Settings

Set a maximum number of models to keep in memory (e.g., limit to 3).
When a new model is loaded and the limit is reached, the oldest unused model is automatically unloaded.
Fully compatible with single_active_backend mode (now defaults to LRU=1 for backward compatibility).

💡 Ideal for servers with limited VRAM or when running multiple models in parallel.

🖥️ UI & UX Polish

Fixed navbar ordering and login icon — clearer navigation and better visual flow.
Prevented tool call overflow in chat view — no more clipped or misaligned content.
Uniformed link paths (e.g., /browse/ instead of browse) for consistency.
Fixed model selection toggle — header updates correctly when switching models.
Consistent button styling — uniform colors, hover effects, and accessibility.

📦 Backward Compatibility & Architecture

Dropped x86_64 Mac support: no longer maintained in GitHub Actions; ARM64 (M1/M2/M3/M4) is now the recommended architecture.
Updated data storage path from /usr/share to /var/lib: follows Linux conventions for mutable data.
Added CUDA 13 support: now available in Docker images and L4T builds.
New VibeVoice TTS backend real-time text-to-speech with voice cloning support. You can install it from the model gallery!
StableDiffusion-GGML now supports LoRA: expand your image-generation capabilities.

🛠️ Fixes & Improvements

Issue: After v3.8.0, /readyz and /healthz endpoints required authentication, breaking Docker health checks and monitoring tools
Issue: Fixed crashes when importing models from Hugging Face URLs with subfolders (e.g., huggingface://user/model/GGUF/model.gguf).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link: https://github.com/mudler/LocalRecall

❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

✅ Star the repo
💬 Contribute code, docs, or feedback
📣 Share with others

Your support keeps this stack alive.

✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Breaking Changes 🛠

chore: switch from /usr/share to /var/lib for data storage by @poretsky in #7361
chore: drop drawin-x86_64 support by @mudler in #7616

Bug fixes 🐛

fix: do not require auth for readyz/healthz endpoints by @mudler in #7403
fix(ui): navbar ordering and login icon by @mudler in #7407
fix: configure sbsa packages for arm64 by @mudler in #7413
fix(ui): prevent box overflow in chat view by @mudler in #7430
fix(ui): Update few links in web UI from 'browse' to '/browse/' by @rampa3 in #7445
fix(paths): remove trailing slash from requests by @mudler in #7451
fix(downloader): do not download model files if not necessary by @mudler in #7492
fix(config): make syncKnownUsecasesFromString idempotent by @mudler in #7493
fix: make sure to close on errors by @mudler in #7521
fix(llama.cpp): handle corner cases with tool array content by @mudler in #7528
fix(7355): Update llama-cpp grpc for v3 interface by @sredman in #7566
fix(chat-ui): model selection toggle and new chat by @mudler in #7574
fix: improve ram estimation by @mudler in #7603
fix(ram): do not read from cgroup by @mudler in #7606
fix: correctly propagate error during model load by @mudler in #7610
fix(ci): remove specific version for grpcio packages by @mudler in #7627
fix(uri): consider subfolders when expanding huggingface URLs by @mintyleaf in #7634

Exciting New Features 🎉

feat: agent jobs panel by @mudler in #7390
chore: refactor css, restyle to be slightly minimalistic by @mudler in https://github.com/mudler/LocalAI/p...

Contributors

mkhludnev, richiejp, and 8 other contributors

Assets 9

1 Join discussion

26 Nov 20:22

mudler

v3.8.0

c0d1d02

v3.8.0

Welcome to LocalAI 3.8.0 !

LocalAI 3.8.0 focuses on smoothing out the user experience and exposing more power to the user without requiring restarts or complex configuration files. This release introduces a new onboarding flow and a universal model loader that handles everything from HF URLs to local files.

We’ve also improved the chat interface, addressed long-standing requests regarding OpenAI API compatibility (specifically SSE streaming standards) and exposed more granular controls for some backends (llama.cpp) and backend management.

📌 TL;DR

Feature	Summary
Universal Model Import	Import directly from Hugging Face, Ollama, OCI, or local paths. Auto-detects backends and handles chat templates.
UI & Index Overhaul	New onboarding wizard, auto-model selection on boot, and a cleaner tabular view for model management.
MCP Live Streaming	New: Agent actions and tool calls are now streamed live via the Model Context Protocol—see reasoning in real-time.
Hot-Reloadable Settings	Modify watchdogs, API keys, P2P settings, and defaults without restarting the container.
Chat enhancements	Chat history and parallel conversations are now persisted in local storage.
Strict SSE Compliance	Fixed streaming format to exactly match OpenAI specs (resolves issues with LangChain/JS clients).
Advanced Config	Fine-tune `context_shift`, `cache_ram`, and `parallel` workers via YAML options.
Logprobs & Logitbias	Added token-level probability support for improved agent/eval workflows.

Feature Breakdown

🚀 Universal Model Import (URL-based)

We have refactored how models are imported. You no longer need to manually write configuration files for common use cases. The new importer accepts URLs from Hugging Face, Ollama, and OCI registries, or local file paths also from the Web interface.

import.mp4

Auto-Detection: The system attempts to identify the correct backend (e.g., llama.cpp vs diffusers) and applies native chat templates (e.g., llama-3, mistral) automatically by reading the model metadata.
Customization during Import: You can override defaults immediately, for example, forcing a specific quantization on a GGUF file or selecting vLLM over transformers.
Multimodal Support: Vision components (mmproj) are detected and configured automatically.
File Safety: We added a safeguard to prevent the deletion of model files (blobs) if they are shared by multiple model configurations.

🎨 Complete UI Overhaul

The web interface has been redesigned for better usability and clearer management.

index.mp4

Onboarding Wizard: A guided flow helps first-time users import or install a model in under 30 seconds.
Auto-Focus & Selection: The input field captures focus automatically, and a default model is loaded on startup so you don't start in a "no model selected" state.
Tabular Management: Models and backends are now organized in a cleaner list view, making it easier to see what is installed.

manage.mp4

🤖 Agentic Ecosystem & MCP Live Streaming

LocalAI 3.8.0 significantly upgrades support for agentic workflows using the Model Context Protocol (MCP).

Live Action Streaming: We have added a new endpoint to stream agent results as they happen. Instead of waiting for the final output, you can now watch the agent "think": seeing tool calls, reasoning steps, and intermediate actions streamed live in the UI.

mcp.mp4

Configuring MCP via the interface is now simplified:

mcp_configuration.mp4

🔁 Runtime System Settings

A new Settings > System panel exposes configuration options that previously required environment variables or a restart.

settings.mp4

Immediate Effect: Toggling Watchdogs, P2P, and Gallery availability applies instantly.
API Key Management: You can now generate, rotate, and expire API keys via the UI.
Network: CORS and CSRF settings are now accessible here (note: these specific network settings still require a restart to take effect).

Note: In order to benefit from persisting runtime settings, in older LocalAI deployments it's necessary to mount the /configuration directory from the container image.

⚙️ Advanced `llama.cpp` Configuration

For power users running large context windows or high-throughput setups, we've exposed additional underlying llama.cpp options in the YAML config. You can now tune context shifting, RAM limits for the KV cache, and parallel worker slots.

options:
- context_shift:false
- cache_ram:-1
- use_jinja:true
- parallel:2
- grpc_servers:localhost:50051,localhost:50052

📊 Logprobs & Logitbias Support

This release adds full support for logitbias and logprobs. This is critical for advanced agentic logic, Self-RAG, and evaluating model confidence / hallucination rates. It supports the OpenAI specification.

🛠️ Fixes & Improvements

OpenAI Compatibility:

SSE Streaming: Fixed a critical issue where streaming responses were slightly non-compliant (e.g., sending empty content chunks or missing finish_reason). This resolves integration issues with openai-node, LangChain, and LlamaIndex.
Top_N Behavior: In the reranker, top_n can now be omitted or set to 0 to return all results, rather than defaulting to an arbitrary limit.

General Fixes:

Model Preview: When downloading, you can now see the actual filename and size before committing to the download.
Tool Handling: Fixed crashes when tool content is missing or malformed.
TTS: Fixed dropdown selection states for TTS models.
Browser Storage: Chat history is now persisted in your browser's local storage. You can switch between parallel chats, rename them, and export them to JSON.
True Cancellation: Clicking "Stop" during a stream now correctly propagates a cancellation context to the backend (works for llama.cpp, vLLM, transformers, and diffusers). This immediately stops generation and frees up resources.

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link: https://github.com/mudler/LocalRecall

❤️ Thank You

Over 35,000 stars and growing. LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

✅ Star the repo
💬 Contribute code, docs, or feedback
📣 Share with others

Your support keeps this stack alive.

✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Bug fixes 🐛

fix(reranker): respect top_n in the request by @mkhludnev in #7025
fix(chatterbox): pin numpy by @mudler in #7198
fix(reranker): support omitting top_n by @mkhludnev in #7199
fix(api): SSE streaming format to comply with specification by @Copilot in #7182
fix(edit): propagate correctly opts when reloading by @mudler in #7233
fix(reranker): llama-cpp sort score desc, crop top_n by @mkhludnev in #7211
fix: handle tool errors by @mudler in https://github.com/mudl...

Contributors

mkhludnev, mudler, and 6 other contributors

Assets 10

2 Join discussion

31 Oct 21:34

mudler

v3.7.0

9ecfdc5

v3.7.0

Welcome to LocalAI 3.7.0 👋

This release introduces Agentic MCP support with full WebUI integration, a brand-new neutts TTS backend, fuzzy model search, long-form TTS chunking for chatterbox, and a complete WebUI overhaul.

We’ve also fixed critical bugs, improved stability, and enhanced compatibility with OpenAI’s APIs.

📌 TL;DR – What’s New in LocalAI 3.7.0

Feature	Summary
🤖 Agentic MCP Support (WebUI-enabled)	Build AI agents that use real tools (web search, code exec). Fully-OpenAI compatible and integrated into the WebUI.
🎙️ neutts TTS Backend (Neuphonic-powered)	Generate natural, high-quality speech with low-latency audio — ideal for voice assistants.
🖼️ WebUI enhancements	Faster, cleaner UI with real-time updates and full YAML model control.
💬 Long-Text TTS Chunking (Chatterbox)	Generate natural-sounding long-form audio by intelligently splitting text and preserving context.
🧩 Advanced Agent Controls	Fine-tune agent behavior with new options for retries, reasoning, and re-evaluation.
📸 New Video Creation Endpoint	We now support the OpenAI-compatible `/v1/videos` endpoint for text-to-video generation.
🐍 Enhanced Whisper compatibility	Whisper.cpp is now supported on various CPU variants (AVX, AVX2, etc.) to prevent `illegal instruction` crashes.
🔍 Fuzzy Gallery Search	Find models in the gallery even with typos (e.g., `gema` finds `gemma`).
📦 Easier Model & Backend Management	Import, edit, and delete models directly via clean YAML in the WebUI.
▶️ Realtime Example	Check out the new realtime voice assistant example (multilingual).
⚠️ Security, Stability & API Compliance	Fixed critical crashes, deadlocks, session events, OpenAI compliance, and JSON schema panics.
🧠 Qwen 3 VL	Support for Qwen 3 VL with llama.cpp/gguf models

🔥 What’s New in Detail

🤖 Agentic MCP Support – Build Intelligent, Tool-Using AI Agents

We're proud to announce full Agentic MCP support a feature for building AI agents that can reason, plan, and execute actions using external tools like web search, code execution, and data retrieval. You can use standard chat/completions endpoint, but powered by an agent in the background.

Full documentation is available here

✅ Now in WebUI: A dedicated toggle appears in the chat interface when a model supports MCP. Just click to enable agent mode.

✨ Key Features:

New Endpoint: POST /mcp/v1/chat/completions (OpenAI-compatible).

Flexible Tool Configuration:

mcp:
  stdio: |
    {
      "mcpServers": {
        "duckduckgo": {
          "command": "docker",
          "args": ["run", "-i", "--rm", "ghcr.io/mudler/mcps/duckduckgo:master"]
        }
      }
    }

Advanced Agent Control via agent config:
```
agent:
  max_attempts: 3
  max_iterations: 5
  enable_reasoning: true
  enable_re_evaluation: true
```
- max_attempts: Retry failed tool calls up to N times.
- max_iterations: Limit how many times the agent can loop through reasoning.
- enable_reasoning: Allow step-by-step thought processes (e.g., chain-of-thought).
- enable_re_evaluation: Re-analyze decisions when tool results are ambiguous.

You can find some plug-n-play MCPs here: https://github.com/mudler/MCPs
Under the hood, MCP functionality is powered by https://github.com/mudler/cogito

🖼️ WebUI enhancements

WebUI had a major overhaul:

The chat view now has an MCP toggle in the chat for models that have mcp settings enabled in the model config file.
The Editor mask of the model has now been simplified to show/edit the YAML settings of the model
More reactive, dropped HTMX in favor of Alpine.js and vanilla javascript
Various fixes including deletion of models

🎙️ Introducing neutts TTS Backend – Natural Speech, Low Latency

Say hello to neutts a new, lightweight TTS backend powered by Neuphonic, delivering high-quality, natural-sounding speech with minimal overhead.

🎛️ Setup Example

name: neutts-english
backend: neutts
parameters:
  model: neuphonic/neutts-air
tts:
  audio_path: "./output.wav"
  streaming: true
options:
  # text transcription of the provided audio file
  - ref_text: "So I'm live on radio..."
known_usecases:
  - tts

🐍 Whisper.cpp enhancements

whisper.cpp CPU variants are now available for:

avx
avx2
avx512
fallback (no optimized instructions available)

These variants are optimized for specific instruction sets and reduce crashes on older or non-AVX CPUs.

🔍 Smarter Gallery Search: Fuzzy & Case-Insensitive Matching

Searching for gemma now finds gemma-3, gemma2, etc. — even with typos like gemaa or gema.

🧩 Improved Tool & Schema Handling – No More Crashes

We’ve fixed multiple edge cases that caused crashes or silent failures in tool usage.

✅ Fixes:

Nullable JSON Schemas: "type": ["string", "null"] now works without panics.
Empty Parameters: Tools with missing or empty parameters now handled gracefully.
Strict Mode Enforcement: When strict_mode: true, the model must pick a tool — no more skipping.
Multi-Type Arrays: Safe handling of ["string", "null"] in function definitions.

🔄 Interaction with Grammar Triggers: strict_mode and grammar rules work together — if a tool is required and the function definition is invalid, the server returns a clear JSON error instead of crashing.

📸 New Video Creation Endpoint: OpenAI-Compatible

LocalAI now supports OpenAI’s /v1/videos endpoint for generating videos from text prompts.

📌 Usage Example:

curl http://localhost:8080/v1/videos \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-..." \
  -d '{
    "model": "sora",
    "prompt": "A cat walking through a forest at sunset",
    "size": "1024x576",
  }'

🧠 Qwen 3 VL in llama.cpp

Support has been added for Qwen 3 VL in llama.cpp. We have updated llama.cpp to latest! As a reminder, Qwen 3 VL and multimodal models are also compatible with our vLLM and MLX backends. Qwen 3 VL models are already available in the model gallery:

qwen3-vl-30b-a3b-instruct
qwen3-vl-30b-a3b-thinking
qwen3-vl-4b-instruct
qwen3-vl-32b-instruct
qwen3-vl-4b-thinking
qwen3-vl-2b-thinking
qwen3-vl-2b-instruct

Note: upgrading the llama.cpp backend is necessary if you already have a LocalAI installation.

🚀 (CI) Gallery Updater Agent: Auto-Detect & Suggest New Models

We’ve added an autonomous CI agent that scans Hugging Face daily for new models and opens PRs to update the gallery.

✨ How It Works:

Scans HF for new, trending models
Extracts base model, quantization, and metadata.
Uses cogito (our agentic framework) to assign the model to the correct family and to obtain the model informations.
Opens a PR with:
- Suggested name, family, and usecases
- Link to HF model
- YAML snippet for import

🔧 Critical Bug Fixes & Stability Improvements

Issue	Fix	Impact
📌 WebUI Crash on Model Load	Fixed `can't evaluate field Name in type string` error	Models now render even without config files
🔁 Deadlock in Model Load/Idle Checks	Guarded against race conditions during model loading	Improved performance under load
📞 Realtime API Compliance	Added `session.created` event; removed redundant `conversation.created`	Works with VoxInput, OpenAI clients, and more
📥 MCP Response Formatting	Output wrapped in `message` field	Matches OpenAI spec — better client compatibility
🛑 JSON Error Responses	Now return clean JSON instead of HTML	Scripts and libraries no longer break on auth failures
🔄 Session Registration	Fixed initial MCP calls failing due to cache issues	Reliable first-time use
🎧 `kokoro` TTS	Returns full audio, not partial	Better for long-form TTS

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

The free, Open Source OpenAI alternative. Acts as a drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI

A powerful Local AI agent management platform. Serves as a drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

Contributors

blob42, richiejp, and 6 other contributors

Assets 10

1 Join discussion

03 Oct 13:08

mudler

v3.6.0

8fb9568

v3.6.0

What's Changed

Bug fixes 🐛

fix: reranking models limited to 512 tokens in llama.cpp backend by @jongames in #6344

Exciting New Features 🎉

feat(kokoro): add support for l4t devices by @mudler in #6322
feat(chatterbox): support multilingual by @mudler in #6240

🧠 Models

chore(model gallery): add qwen-image-edit-2509 by @mudler in #6336
chore(models): add whisper-turbo via whisper.cpp by @mudler in #6340
chore(model gallery): add ibm-granite_granite-4.0-h-small by @mudler in #6373
chore(model gallery): add ibm-granite_granite-4.0-h-tiny by @mudler in #6374
chore(model gallery): add ibm-granite_granite-4.0-h-micro by @mudler in #6375
chore(model gallery): add ibm-granite_granite-4.0-micro by @mudler in #6376

👒 Dependencies

chore(deps): bump grpcio from 1.74.0 to 1.75.0 in /backend/python/transformers by @dependabot[bot] in #6332
chore(deps): bump securego/gosec from 2.22.8 to 2.22.9 by @dependabot[bot] in #6324
chore(deps): bump llama.cpp to '72b24d96c6888c609d562779a23787304ae4609c' by @mudler in #6349
chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/coqui by @dependabot[bot] in #6353
chore(deps): bump transformers from 4.48.3 to 4.56.2 in /backend/python/coqui by @dependabot[bot] in #6330
chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/diffusers by @dependabot[bot] in #6361
chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/rerankers by @dependabot[bot] in #6360
chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/common/template by @dependabot[bot] in #6358
chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/vllm by @dependabot[bot] in #6357
chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/bark by @dependabot[bot] in #6359
chore(deps): bump grpcio from 1.75.0 to 1.75.1 in /backend/python/transformers by @dependabot[bot] in #6362
chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/exllama2 by @dependabot[bot] in #6356

Other Changes

chore: ⬆️ Update ggml-org/llama.cpp to 7f766929ca8e8e01dcceb1c526ee584f7e5e1408 by @localai-bot in #6319
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #6318
chore: ⬆️ Update ggml-org/llama.cpp to da30ab5f8696cabb2d4620cdc0aa41a298c54fd6 by @localai-bot in #6321
chore: ⬆️ Update ggml-org/llama.cpp to 1d0125bcf1cbd7195ad0faf826a20bc7cec7d3f4 by @localai-bot in #6335
chore(cudss): add cudds to l4t images by @mudler in #6338
chore: ⬆️ Update ggml-org/llama.cpp to 4ae88d07d026e66b41e85afece74e88af54f4e66 by @localai-bot in #6339
CI: disable build-testing on PRs against arm64 by @mudler in #6341
chore(deps): bump llama.cpp to '835b2b915c52bcabcd688d025eacff9a07b65f52' by @mudler in #6347
chore: ⬆️ Update ggml-org/llama.cpp to 4807e8f96a61b2adccebd5e57444c94d18de7264 by @localai-bot in #6350
chore: ⬆️ Update ggml-org/llama.cpp to bd0af02fc96c2057726f33c0f0daf7bb8f3e462a by @localai-bot in #6352
Revert "chore(deps): bump transformers from 4.48.3 to 4.56.2 in /backend/python/coqui" by @mudler in #6363
chore: ⬆️ Update ggml-org/whisper.cpp to 32be14f8ebfc0498c2c619182f0d7f4c822d52c4 by @localai-bot in #6354
chore: ⬆️ Update ggml-org/llama.cpp to 5f7e166cbf7b9ca928c7fad990098ef32358ac75 by @localai-bot in #6355
chore: ⬆️ Update ggml-org/llama.cpp to b2ba81dbe07b6dbea9c96b13346c66973dede32c by @localai-bot in #6366
chore: ⬆️ Update ggml-org/whisper.cpp to 8c0855fd6bb115e113c0dca6255ea05f774d35f7 by @localai-bot in #6365
chore: ⬆️ Update ggml-org/whisper.cpp to 7849aff7a2e1f4234aa31b01a1870906d5431959 by @localai-bot in #6367
chore: ⬆️ Update ggml-org/llama.cpp to 1fe4e38cc20af058ed320bd46cac934991190056 by @localai-bot in #6368
chore: ⬆️ Update ggml-org/llama.cpp to d64c8104f090b27b1f99e8da5995ffcfa6b726e2 by @localai-bot in #6371

New Contributors

@jongames made their first contribution in #6344

Full Changelog: v3.5.4...v3.6.0

Contributors

mudler, jongames, and 2 other contributors

Assets 10

20 Sep 07:49

mudler

v3.5.4

f7f26b8

v3.5.4

What's Changed

Bug fixes 🐛

fix(python): make option check uniform across backends by @mudler in #6314

Other Changes

chore: ⬆️ Update ggml-org/whisper.cpp to 44fa2f647cf2a6953493b21ab83b50d5f5dbc483 by @localai-bot in #6317
chore: ⬆️ Update ggml-org/llama.cpp to f432d8d83e7407073634c5e4fd81a3d23a10827f by @localai-bot in #6316
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #6315

Full Changelog: v3.5.3...v3.5.4

Contributors

mudler and localai-bot

Assets 10

19 Sep 17:10

mudler

v3.5.3

c27da0a

v3.5.3

What's Changed

Bug fixes 🐛

fix(diffusers): fix float detection by @mudler in #6313

🧠 Models

chore(model gallery): add mistralai_magistral-small-2509 by @mudler in #6309
chore(model gallery): add impish_qwen_14b-1m by @mudler in #6310
chore(model gallery): add aquif-3.5-a4b-think by @mudler in #6311

👒 Dependencies

chore: ⬆️ Update ggml-org/llama.cpp to 3edd87cd055a45d885fa914d879d36d33ecfc3e1 by @localai-bot in #6308

Other Changes

docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #6307

Full Changelog: v3.5.2...v3.5.3

Contributors

mudler and localai-bot

Assets 10

18 Sep 07:37

mudler

v3.5.2

902e47f

v3.5.2

What's Changed

👒 Dependencies

Revert "feat(nvidia-gpu): bump images to cuda 12.8" by @mudler in #6303

Other Changes

docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #6305
chore: ⬆️ Update ggml-org/llama.cpp to 0320ac5264279d74f8ee91bafa6c90e9ab9bbb91 by @localai-bot in #6306

Full Changelog: v3.5.1...v3.5.2

Contributors

mudler and localai-bot

Assets 10

Uh oh!

Releases: mudler/LocalAI

v3.11.0

🎉 LocalAI 3.11.0 Release! 🚀

📌 TL;DR

🚀 New Features & Major Enhancements

🎙️ Realtime Audio Conversations

🎵 Music Generation UI & Ace-Step

🎧 Massive ASR (Speech-to-Text) Expansion

🗣️ TTS Streaming & New Voices

🛠️ Hardware & Backend Updates

⚠️ Breaking Changes

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

LocalAGI

LocalRecall

❤️ Thank You

✅ Full Changelog

What's Changed

Breaking Changes 🛠

Bug fixes 🐛

Exciting New Features 🎉

🧠 Models

Contributors

Uh oh!

v3.10.1

What's Changed

Bug fixes 🐛

Exciting New Features 🎉

🧠 Models

👒 Dependencies

Other Changes

Contributors

Uh oh!

v3.10.0

🎉 LocalAI 3.10.0 Release! 🚀

📌 TL;DR

🚀 New Features & Major Enhancements

🤖 Open Responses API: Build Smarter, Autonomous Agents

🧠 Anthropic Messages API: Clone Claude Locally

🎥 Video Generation: From Text to Video in the Web UI

⚙️ Unified GPU Backends: Acceleration Works Out of the Box

🧩 Tool Streaming & Advanced Parsing

🌐 System-Aware Backend Gallery: Only Compatible Backends Show

🎤 New TTS Backends: Pocket-TTS

🔍 Request Tracing: Debug Your Agents

🪄 New 'Reasoning' Field: Extract Thinking Steps

🚀 Moonshine Backend: Faster Transcription for Low-End Devices

🛠️ Fixes & Stability Improvements

🔧 Prevent BMI2 Crashes on AVX-Only CPUs

📊 Fix Swapped VRAM Usage on AMD GPUs

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

LocalAGI

LocalRecall

Contributors

Uh oh!

v3.9.0

Xmas-release 🎅 LocalAI 3.9.0! 🚀

📌 TL;DR

🚀 New Features

🤖 Agent Jobs Panel: Schedule & Automate Tasks

🧠 Smart Memory Reclaimer: Auto-Optimize GPU Resources

🔁 LRU Model Eviction: Intelligent Model Management

🖥️ UI & UX Polish

📦 Backward Compatibility & Architecture

🛠️ Fixes & Improvements

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

LocalAGI

LocalRecall

❤️ Thank You

✅ Full Changelog

What's Changed

Breaking Changes 🛠

Bug fixes 🐛

Exciting New Features 🎉

Contributors

Uh oh!

v3.8.0

⚙️ Advanced `llama.cpp` Configuration