Releases: mudler/LocalAI
v3.11.0
🎉 LocalAI 3.11.0 Release! 🚀
LocalAI 3.11.0 is a massive update for Audio and Multimodal capabilities.
We are introducing Realtime Audio Conversations, a dedicated Music Generation UI, and a massive expansion of ASR (Speech-to-Text) and TTS backends. Whether you want to talk to your AI, clone voices, transcribe with speaker identification, or generate songs, this release has you covered.
Check out the highlights below!
📌 TL;DR
| Feature | Summary |
|---|---|
| Realtime Audio | Native support for audio conversations, enabling fluid voice interactions similar to OpenAI's Realtime API. Documentation |
| Music Generation UI | New UI interface for MusicGen (Ace-Step), allowing you to generate music from text prompts directly in the browser. |
| New ASR Backends | Added WhisperX (with Speaker Diarization), VibeVoice, Qwen-ASR, and Nvidia NeMo. |
| TTS Streaming | Text-to-Speech now supports streaming mode for lower latency responses. (VoxCPM only for now) |
| vLLM Omni | Added support for vLLM Omni, expanding our high-performance inference capabilities. |
| Speaker Diarization | Native support for identifying different speakers in transcriptions via WhisperX. |
| Hardware Expansion | Expanded build support for CUDA 12/13, L4T (Jetson), SBSA, and better Metal (Apple Silicon) integration with MLX backends |
| Breaking Changes | ExLlama (deprecated) and Bark (unmaintained) backends have been removed. |
🚀 New Features & Major Enhancements
🎙️ Realtime Audio Conversations
LocalAI 3.11.0 introduces native support for Realtime Audio Conversations.
- Enables fluid, low-latency voice interaction with agents.
- Logic handled directly within the LocalAI pipeline for seamless audio-in/audio-out workflows.
- Support for STT/TTS and voice-to-voice models (experimental)
- Support for tool calls
🗣️ Talk to your LocalAI: This brings us one step closer to a fully local, voice-native assistant experience compatible with standard client implementations.
Check here for detailed documentation.
🎵 Music Generation UI & Ace-Step
We have added a dedicated interface for music generation!
- New Backend: Support for Ace-Step (MusicGen) via the
ace-stepbackend. - Web UI Integration: Generate musical clips directly from the LocalAI Web UI.
- Simple text-to-music workflow (e.g., "Lo-fi hip hop beat for studying").
🎧 Massive ASR (Speech-to-Text) Expansion
This release significantly broadens our transcription capabilities with four new backends:
- WhisperX: Provides fast transcription with Speaker Diarization (identifying who is speaking).
- VibeVoice: Now supports also ASR alongside TTS.
- Qwen-ASR: Support for Qwen's powerful speech recognition models.
- Nvidia NeMo: Initial support for NeMo ASR.
🗣️ TTS Streaming & New Voices
Text-to-Speech gets a speed boost and new options:
- Streaming Support: TTS endpoints now support streaming, reducing the "time-to-first-audio" significantly.
- VoxCPM: Added support for the VoxCPM backend.
- Qwen-TTS: Added support for Qwen-TTS models
- Piper Voices: Added most remaining Piper voices from Hugging Face to the gallery.
🛠️ Hardware & Backend Updates
- vLLM Omni: A new backend integration for vLLM Omni models.
- Extended Platform Support: Major work on MLX to improve compatibility across CUDA 12, CUDA 13, L4T (Nvidia Jetson), SBSA, and macOS Metal.
- GGUF Cleanup: Dropped redundant VRAM estimation logic for GGUF loading, relying on more accurate internal measurements.
⚠️ Breaking Changes
To keep the project lean and maintainable, we have removed some older backends:
- ExLlama: Removed (deprecated in favor of newer loaders like ExLlamaV2 or llama.cpp).
- Bark: Removed (the upstream project is unmaintained; we recommend using the new TTS alternatives).
🚀 The Complete Local Stack for Privacy-First AI
LocalAI |
The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required. |
LocalAGI |
Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI. |
LocalRecall |
RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI. |
❤️ Thank You
LocalAI is a true FOSS movement — built by contributors, powered by community.
If you believe in privacy-first AI:
- ✅ Star the repo
- 💬 Contribute code, docs, or feedback
- 📣 Share with others
Your support keeps this stack alive.
✅ Full Changelog
📋 Click to expand full changelog
What's Changed
Breaking Changes 🛠
Bug fixes 🐛
- fix(ui): correctly display selected image model by @dedyf5 in #8208
- fix(ui): take account of reasoning in token count calculation by @mudler in #8324
- fix: drop gguf VRAM estimation (now redundant) by @mudler in #8325
- fix(api): Add missing field in initial OpenAI streaming response by @acon96 in #8341
- fix(realtime): Include noAction function in prompt template and handle tool_choice by @richiejp in #8372
- fix: filter GGUF and GGML files from model list by @Yaroslav98214 in #8397
- fix(qwen-asr): Remove contagious slop (DEFAULT_GOAL) from Makefile by @richiejp in #8431
Exciting New Features 🎉
- feat(vllm-omni): add new backend by @mudler in #8188
- feat(vibevoice): add ASR support by @mudler in #8222
- feat: add VoxCPM tts backend by @mudler in #8109
- feat(realtime): Add audio conversations by @richiejp in #6245
- feat(qwen-asr): add support to qwen-asr by @mudler in #8281
- feat(tts): add support for streaming mode by @mudler in #8291
- feat(api): Add transcribe response format request parameter & adjust STT backends by @nanoandrew4 in #8318
- feat(whisperx): add whisperx backend for transcription with speaker diarization by @eureka928 in #8299
- feat(mlx): Add support for CUDA12, CUDA13, L4T, SBSA and CPU by @mudler in #8380
- feat(musicgen): add ace-step and UI interface by @mudler in #8396
- fix(api)!: Stop model prior to deletion by @nanoandrew4 in #8422
- feat(nemo): add Nemo (only asr for now) backend by @mudler in #8436
🧠 Models
- chore(model gallery): add qwen3-tts to model gallery by @mudler in #8187
- chore(model gallery): Add most of not yet present Piper voices from Hugging Face by @rampa3 in #8202
- chore: drop bark which is unmaintained by @mudler in #8207
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8220
- chore(model gallery): Add entry for Mistral Small 3.1 with mmproj by @rampa3 in https://git...
v3.10.1
This is a small patch release intended to provide bugfixes and minor polishment, along, we also added support to Qwen-TTS that was just released yesterday.
- Fix reasoning detection on reasoning and instruct models
- Support reasoning blocks with openresponses
- API fixes to correctly run LTX-2
- Support Qwen3-TTS!
What's Changed
Bug fixes 🐛
- fix(reasoning): support models with reasoning without starting thinking tag by @mudler in #8132
- fix(tracing): Create trace buffer on first request to enable tracing at runtime by @richiejp in #8148
- fix(videogen): drop incomplete endpoint, add GGUF support for LTX-2 by @mudler in #8160
Exciting New Features 🎉
- feat(openresponses): Support reasoning blocks by @mudler in #8133
- feat: detect thinking support from backend automatically if not explicitly set by @mudler in #8167
- feat(qwen-tts): add Qwen-tts backend by @mudler in #8163
🧠 Models
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8128
- chore(model gallery): add flux 2 and flux 2 klein by @mudler in #8141
- chore(model-gallery): ⬆️ update checksum by @localai-bot in #8153
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8157
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8170
👒 Dependencies
- chore(deps): bump github.com/mudler/cogito from 0.7.2 to 0.8.1 by @dependabot[bot] in #8124
Other Changes
- feat(swagger): update swagger by @localai-bot in #8098
- chore: ⬆️ Update ggml-org/llama.cpp to
287a33017b32600bfc0e81feeb0ad6e81e0dd484by @localai-bot in #8100 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
2efd19978dd4164e387bf226025c9666b6ef35e2by @localai-bot in #8099 - docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8120
- chore: ⬆️ Update leejet/stable-diffusion.cpp to
a48b4a3ade9972faf0adcad47e51c6fc03f0e46dby @localai-bot in #8121 - chore: ⬆️ Update ggml-org/llama.cpp to
959ecf7f234dc0bc0cd6829b25cb0ee1481aa78aby @localai-bot in #8122 - chore(deps): Bump llama.cpp to '1c7cf94b22a9dc6b1d32422f72a627787a4783a3' by @mudler in #8136
- chore: drop noisy logs by @mudler in #8142
- chore: ⬆️ Update ggml-org/llama.cpp to
ad8d85bd94cc86e89d23407bdebf98f2e6510c61by @localai-bot in #8145 - chore: ⬆️ Update ggml-org/whisper.cpp to
7aa8818647303b567c3a21fe4220b2681988e220by @localai-bot in #8146 - feat(swagger): update swagger by @localai-bot in #8150
- chore(diffusers): add 'av' to requirements.txt by @mudler in #8155
- chore: ⬆️ Update leejet/stable-diffusion.cpp to
329571131d62d64a4f49e1acbef49ae02544fdcdby @localai-bot in #8152 - chore: ⬆️ Update ggml-org/llama.cpp to
c301172f660a1fe0b42023da990bf7385d69adb4by @localai-bot in #8151 - chore: ⬆️ Update ggml-org/llama.cpp to
a5eaa1d6a3732bc0f460b02b61c95680bba5a012by @localai-bot in #8165 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
5e4579c11d0678f9765463582d024e58270faa9cby @localai-bot in #8166
Full Changelog: v3.10.0...v3.10.1
v3.10.0
🎉 LocalAI 3.10.0 Release! 🚀
LocalAI 3.10.0 is big on agent capabilities, multi-modal support, and cross-platform reliability.
We've added native Anthropic API support, launched a new Video Generation UI, introduced Open Responses API compatibility, and enhanced performance with a unified GPU backend system.
For a full tour, see below!
📌 TL;DR
| Feature | Summary |
|---|---|
| Anthropic API Support | Fully compatible /v1/messages endpoint for seamless drop-in replacement of Claude. |
| Open Responses API | Native support for stateful agents with tool calling, streaming, background mode, and multi-turn conversations, passing all official acceptance tests. |
| Video & Image Generation Suite | New video gen UI + LTX-2 support for text-to-video and image-to-video. |
| Unified GPU Backends | GPU libraries (CUDA, ROCm, Vulkan) packaged inside backend containers — works out of the box on Nvidia, AMD, and ARM64 (Experimental). |
| Tool Streaming & XML Parsing | Full support for streaming tool calls and XML-formatted tool outputs. |
| System-Aware Backend Gallery | Only see backends your system can run (e.g., hide MLX on Linux). |
| Crash Fixes | Prevents crashes on AVX-only CPUs (Intel Sandy/Ivy Bridge) and fixes VRAM reporting on AMD GPUs. |
| Request Tracing | Debug agents & fine-tuning with memory-based request/response logging. |
| Moonshine Backend | Ultra-fast transcription engine for low-end devices. |
| Pocket-TTS | Lightweight, high-fidelity text-to-speech with voice cloning. |
| Vulkan arm64 builds | We now build backends and images for vulkan on arm64 as well |
🚀 New Features & Major Enhancements
🤖 Open Responses API: Build Smarter, Autonomous Agents
LocalAI now supports the OpenAI Responses API, enabling powerful agentic workflows locally.
- Stateful conversations via
response_id— resume and manage long-running agent sessions. - Background mode: Run agents asynchronously and fetch results later.
- Streaming support for tools, images, and audio.
- Built-in tools: Web search, file search, and computer use (via MCP integrations).
- Multi-turn interaction with dynamic context and tool use.
✅ Ideal for developers building agents that can browse, analyze files, or interact with systems — all on your local machine.
🔧 How to Use:
- Set
response_idin your request to maintain session state across calls.- Use
background: trueto run agents asynchronously.- Retrieve results via
GET /api/v1/responses/{response_id}.- Enable streaming with
stream: trueto receive partial responses and tool calls in real time.
📌 Tip: Use
response_idto build agent orchestration systems that persist context and avoid redundant computation.
Our support passes all the official acceptance tests:
🧠 Anthropic Messages API: Clone Claude Locally
LocalAI now fully supports the Anthropic messages API.
- Use
https://api.localai.host/v1/messagesas a drop-in replacement for Claude. - Full tool/function calling support, just like OpenAI.
- Streaming and non-streaming responses.
- Compatible with
anthropic-sdk-go, LangChain, and other tooling.
🔥 Perfect for teams migrating from Anthropic to local inference with full feature parity.
🎥 Video Generation: From Text to Video in the Web UI
- New dedicated video generation page with intuitive controls.
- LTX-2 is supported
- Supports text-to-video and image-to-video workflows.
- Built on top of
diffuserswith full compatibility.
📌 How to Use:
- Go to
/videoin the web UI.- Enter a prompt (e.g., "A cat walking on a moonlit rooftop").
- Optionally upload an image for image-to-video generation.
- Adjust parameters like
fps,num_frames, andguidance_scale.
⚙️ Unified GPU Backends: Acceleration Works Out of the Box
A major architectural upgrade: GPU libraries (CUDA, ROCm, Vulkan) are now packaged inside backend containers.
- Single image: Now you don't need anymore to pull a specific image for your GPU. Any image works regardless if you have a GPU or not.
- No more manual GPU driver setup — just run the image and get acceleration.
- Works on Nvidia (CUDA), AMD (ROCm), and ARM64 (Vulkan).
- Vulkan arm64 builds enabled
- Reduced image complexity, faster builds, and consistent performance.
🚀 This means latest/master images now support GPU acceleration on all platforms — no extra config!
Note: this is experimental, please help us by filing an issue if something doesn't work!
🧩 Tool Streaming & Advanced Parsing
Enhance your agent workflows with richer tool interaction.
- Streaming tool calls: Receive partial tool arguments in real time (e.g.,
input_json_delta). - XML-style tool call parsing: Models that return tools in XML format (
<function>...</function>) are now properly parsed alongside text. - Works across all backends (llama.cpp, vLLM, diffusers, etc.).
💡 Enables more natural, real-time interaction with agents that use structured tool outputs.
🌐 System-Aware Backend Gallery: Only Compatible Backends Show
The backend gallery now shows only backends your system can run.
- Auto-detects system capabilities (CPU, GPU, MLX, etc.).
- Hides unsupported backends (e.g., MLX on Linux, CUDA on AMD).
- Shows detected capabilities in the hero section.
🎤 New TTS Backends: Pocket-TTS
Add expressive voice generation to your apps with Pocket-TTS.
- Real-time text-to-speech with voice cloning support (requires HF login).
- Lightweight, fast, and open-source.
- Available in the model gallery.
🗣️ Perfect for voice agents, narrators, or interactive assistants.
❗ Note: Voice cloning requires HF authentication and a registered voice model.
🔍 Request Tracing: Debug Your Agents
Trace requests and responses in memory — great for fine-tuning and agent debugging.
- Enable via runtime setting or API.
- Log stored in memory, dropped after max size.
- Fetch logs via
GET /api/v1/trace. - Export to JSON for analysis.
🪄 New 'Reasoning' Field: Extract Thinking Steps
LocalAI now automatically detects and extracts thinking tags from model output.
- Supports both SSE and non-SSE modes.
- Displays reasoning steps in the chat UI (under "Thinking" tab).
- Fixes issue where thinking content appeared as part of final answer.
🚀 Moonshine Backend: Faster Transcription for Low-End Devices
Add Moonshine, an ONNX-based transcription engine, for fast, lightweight speech-to-text.
- Optimized for low-end devices (Raspberry Pi, older laptops).
- One of the fastest transcription engines available.
- Supports live transcription.
🛠️ Fixes & Stability Improvements
🔧 Prevent BMI2 Crashes on AVX-Only CPUs
Fixed crashes on older Intel CPUs (Ivy Bridge, Sandy Bridge) that lack BMI2 instructions.
- Now safely falls back to
llama-cpp-fallback(SSE2 only). - No more
EOFerrors during model warmup.
✅ Ensures LocalAI runs smoothly on older hardware.
📊 Fix Swapped VRAM Usage on AMD GPUs
Correctly parses rocm-smi output: used and total VRAM are now displayed correctly.
- Fixes misreported memory usage on dual-Radeon setups.
- Handles
HIP_VISIBLE_DEVICESproperly (e.g., when using only discrete GPU).
🚀 The Complete Local Stack for Privacy-First AI
LocalAI |
The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required. |
LocalAGI |
Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI. |
LocalRecall |
RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI. Link: |
v3.9.0
Xmas-release 🎅 LocalAI 3.9.0! 🚀
LocalAI 3.9.0 is focused on stability, resource efficiency, and smarter agent workflows. We've addressed critical issues with model loading, improved system resource management, and introduced a new Agent Jobs panel for scheduling and managing background agentic tasks. Whether you're running models locally or orchestrating complex agent workflows, this release makes it faster, more reliable, and easier to manage.
📌 TL;DR
| Feature | Summary |
|---|---|
| Agent Jobs Panel | Schedule and run background tasks with cron or via API — perfect for automated workflows. |
| Smart Memory Reclaimer | Automatically frees up GPU/VRAM by evicting least recently used models when memory is low. |
| LRU Model Eviction | Models are automatically unloaded from memory based on usage patterns to prevent crashes. |
| MLX & CUDA 13 Support | New model backends and enhanced GPU compatibility for modern hardware. |
| UI Polish & Fixes | Cleaned-up navigation, fixed layout overflow, and various improvements. |
| Vibevoice | Added support for the vibevoice backend! |
🚀 New Features
🤖 Agent Jobs Panel: Schedule & Automate Tasks
LocalAI 3.9.0 introduces a new Agent Jobs panel in the web UI and API, allowing you to create, run, and schedule agentic tasks in the background that can be started programmatically via API or from the Web interface.
- Run agent prompts on a schedule using cron syntax, or via API.
- Agents are defined via the model settings, supporting MCP.
- Trigger jobs via API for integration into CI/CD or external tools.
- Optionally send results to a webhook for post-processing.
- Templates and prompts can be dynamically populated with variables.
✅ Use cases: Daily reports, CI integration, automated data processing, scheduled model evaluations.
🧠 Smart Memory Reclaimer: Auto-Optimize GPU Resources
We’ve introduced a new Memory Reclaimer that monitors system memory usage and automatically frees up GPU/VRAM when needed.
- Tracks memory consumption across all backends.
- When usage exceeds a configured threshold, it evicts the least recently used (LRU) models.
- Prevents out-of-memory crashes and keeps your system stable during high load.
This is a step toward adaptive resource management, future versions will expand this with more advanced policies and giving more control.
🔁 LRU Model Eviction: Intelligent Model Management
Building on the new reclaimer, LocalAI now supports LRU (Least Recently Used) eviction for loaded models.
- Set a maximum number of models to keep in memory (e.g., limit to 3).
- When a new model is loaded and the limit is reached, the oldest unused model is automatically unloaded.
- Fully compatible with
single_active_backendmode (now defaults to LRU=1 for backward compatibility).
💡 Ideal for servers with limited VRAM or when running multiple models in parallel.
🖥️ UI & UX Polish
- Fixed navbar ordering and login icon — clearer navigation and better visual flow.
- Prevented tool call overflow in chat view — no more clipped or misaligned content.
- Uniformed link paths (e.g.,
/browse/instead ofbrowse) for consistency. - Fixed model selection toggle — header updates correctly when switching models.
- Consistent button styling — uniform colors, hover effects, and accessibility.
📦 Backward Compatibility & Architecture
- Dropped x86_64 Mac support: no longer maintained in GitHub Actions; ARM64 (M1/M2/M3/M4) is now the recommended architecture.
- Updated data storage path from
/usr/shareto/var/lib: follows Linux conventions for mutable data. - Added CUDA 13 support: now available in Docker images and L4T builds.
- New VibeVoice TTS backend real-time text-to-speech with voice cloning support. You can install it from the model gallery!
- StableDiffusion-GGML now supports LoRA: expand your image-generation capabilities.
🛠️ Fixes & Improvements
- Issue: After v3.8.0,
/readyzand/healthzendpoints required authentication, breaking Docker health checks and monitoring tools - Issue: Fixed crashes when importing models from Hugging Face URLs with subfolders (e.g.,
huggingface://user/model/GGUF/model.gguf).
🚀 The Complete Local Stack for Privacy-First AI
LocalAI |
The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required. |
LocalAGI |
Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI. |
LocalRecall |
RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI. |
❤️ Thank You
LocalAI is a true FOSS movement — built by contributors, powered by community.
If you believe in privacy-first AI:
- ✅ Star the repo
- 💬 Contribute code, docs, or feedback
- 📣 Share with others
Your support keeps this stack alive.
✅ Full Changelog
📋 Click to expand full changelog
What's Changed
Breaking Changes 🛠
- chore: switch from /usr/share to /var/lib for data storage by @poretsky in #7361
- chore: drop drawin-x86_64 support by @mudler in #7616
Bug fixes 🐛
- fix: do not require auth for readyz/healthz endpoints by @mudler in #7403
- fix(ui): navbar ordering and login icon by @mudler in #7407
- fix: configure sbsa packages for arm64 by @mudler in #7413
- fix(ui): prevent box overflow in chat view by @mudler in #7430
- fix(ui): Update few links in web UI from 'browse' to '/browse/' by @rampa3 in #7445
- fix(paths): remove trailing slash from requests by @mudler in #7451
- fix(downloader): do not download model files if not necessary by @mudler in #7492
- fix(config): make syncKnownUsecasesFromString idempotent by @mudler in #7493
- fix: make sure to close on errors by @mudler in #7521
- fix(llama.cpp): handle corner cases with tool array content by @mudler in #7528
- fix(7355): Update llama-cpp grpc for v3 interface by @sredman in #7566
- fix(chat-ui): model selection toggle and new chat by @mudler in #7574
- fix: improve ram estimation by @mudler in #7603
- fix(ram): do not read from cgroup by @mudler in #7606
- fix: correctly propagate error during model load by @mudler in #7610
- fix(ci): remove specific version for grpcio packages by @mudler in #7627
- fix(uri): consider subfolders when expanding huggingface URLs by @mintyleaf in #7634
Exciting New Features 🎉
- feat: agent jobs panel by @mudler in #7390
- chore: refactor css, restyle to be slightly minimalistic by @mudler in https://github.com/mudler/LocalAI/p...
v3.8.0
Welcome to LocalAI 3.8.0 !
LocalAI 3.8.0 focuses on smoothing out the user experience and exposing more power to the user without requiring restarts or complex configuration files. This release introduces a new onboarding flow and a universal model loader that handles everything from HF URLs to local files.
We’ve also improved the chat interface, addressed long-standing requests regarding OpenAI API compatibility (specifically SSE streaming standards) and exposed more granular controls for some backends (llama.cpp) and backend management.
📌 TL;DR
| Feature | Summary |
|---|---|
| Universal Model Import | Import directly from Hugging Face, Ollama, OCI, or local paths. Auto-detects backends and handles chat templates. |
| UI & Index Overhaul | New onboarding wizard, auto-model selection on boot, and a cleaner tabular view for model management. |
| MCP Live Streaming | New: Agent actions and tool calls are now streamed live via the Model Context Protocol—see reasoning in real-time. |
| Hot-Reloadable Settings | Modify watchdogs, API keys, P2P settings, and defaults without restarting the container. |
| Chat enhancements | Chat history and parallel conversations are now persisted in local storage. |
| Strict SSE Compliance | Fixed streaming format to exactly match OpenAI specs (resolves issues with LangChain/JS clients). |
| Advanced Config | Fine-tune context_shift, cache_ram, and parallel workers via YAML options. |
| Logprobs & Logitbias | Added token-level probability support for improved agent/eval workflows. |
Feature Breakdown
🚀 Universal Model Import (URL-based)
We have refactored how models are imported. You no longer need to manually write configuration files for common use cases. The new importer accepts URLs from Hugging Face, Ollama, and OCI registries, or local file paths also from the Web interface.
import.mp4
- Auto-Detection: The system attempts to identify the correct backend (e.g.,
llama.cppvsdiffusers) and applies native chat templates (e.g.,llama-3,mistral) automatically by reading the model metadata. - Customization during Import: You can override defaults immediately, for example, forcing a specific quantization on a GGUF file or selecting
vLLMovertransformers. - Multimodal Support: Vision components (
mmproj) are detected and configured automatically. - File Safety: We added a safeguard to prevent the deletion of model files (blobs) if they are shared by multiple model configurations.
🎨 Complete UI Overhaul
The web interface has been redesigned for better usability and clearer management.
index.mp4
- Onboarding Wizard: A guided flow helps first-time users import or install a model in under 30 seconds.
- Auto-Focus & Selection: The input field captures focus automatically, and a default model is loaded on startup so you don't start in a "no model selected" state.
- Tabular Management: Models and backends are now organized in a cleaner list view, making it easier to see what is installed.
manage.mp4
🤖 Agentic Ecosystem & MCP Live Streaming
LocalAI 3.8.0 significantly upgrades support for agentic workflows using the Model Context Protocol (MCP).
- Live Action Streaming: We have added a new endpoint to stream agent results as they happen. Instead of waiting for the final output, you can now watch the agent "think": seeing tool calls, reasoning steps, and intermediate actions streamed live in the UI.
mcp.mp4
Configuring MCP via the interface is now simplified:
mcp_configuration.mp4
🔁 Runtime System Settings
A new Settings > System panel exposes configuration options that previously required environment variables or a restart.
settings.mp4
- Immediate Effect: Toggling Watchdogs, P2P, and Gallery availability applies instantly.
- API Key Management: You can now generate, rotate, and expire API keys via the UI.
- Network: CORS and CSRF settings are now accessible here (note: these specific network settings still require a restart to take effect).
Note: In order to benefit from persisting runtime settings, in older LocalAI deployments it's necessary to mount the
/configurationdirectory from the container image.
⚙️ Advanced llama.cpp Configuration
For power users running large context windows or high-throughput setups, we've exposed additional underlying llama.cpp options in the YAML config. You can now tune context shifting, RAM limits for the KV cache, and parallel worker slots.
options:
- context_shift:false
- cache_ram:-1
- use_jinja:true
- parallel:2
- grpc_servers:localhost:50051,localhost:50052📊 Logprobs & Logitbias Support
This release adds full support for logitbias and logprobs. This is critical for advanced agentic logic, Self-RAG, and evaluating model confidence / hallucination rates. It supports the OpenAI specification.
🛠️ Fixes & Improvements
OpenAI Compatibility:
- SSE Streaming: Fixed a critical issue where streaming responses were slightly non-compliant (e.g., sending empty content chunks or missing
finish_reason). This resolves integration issues withopenai-node,LangChain, andLlamaIndex. - Top_N Behavior: In the reranker,
top_ncan now be omitted or set to0to return all results, rather than defaulting to an arbitrary limit.
General Fixes:
- Model Preview: When downloading, you can now see the actual filename and size before committing to the download.
- Tool Handling: Fixed crashes when tool content is missing or malformed.
- TTS: Fixed dropdown selection states for TTS models.
- Browser Storage: Chat history is now persisted in your browser's local storage. You can switch between parallel chats, rename them, and export them to JSON.
- True Cancellation: Clicking "Stop" during a stream now correctly propagates a cancellation context to the backend (works for
llama.cpp,vLLM,transformers, anddiffusers). This immediately stops generation and frees up resources.
🚀 The Complete Local Stack for Privacy-First AI
LocalAI |
The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required. |
LocalAGI |
Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI. |
LocalRecall |
RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI. |
❤️ Thank You
Over 35,000 stars and growing. LocalAI is a true FOSS movement — built by contributors, powered by community.
If you believe in privacy-first AI:
- ✅ Star the repo
- 💬 Contribute code, docs, or feedback
- 📣 Share with others
Your support keeps this stack alive.
✅ Full Changelog
📋 Click to expand full changelog
What's Changed
Bug fixes 🐛
- fix(reranker): respect
top_nin the request by @mkhludnev in #7025 - fix(chatterbox): pin numpy by @mudler in #7198
- fix(reranker): support omitting top_n by @mkhludnev in #7199
- fix(api): SSE streaming format to comply with specification by @Copilot in #7182
- fix(edit): propagate correctly opts when reloading by @mudler in #7233
- fix(reranker): llama-cpp sort score desc, crop top_n by @mkhludnev in #7211
- fix: handle tool errors by @mudler in https://github.com/mudl...
v3.7.0
Welcome to LocalAI 3.7.0 👋
This release introduces Agentic MCP support with full WebUI integration, a brand-new neutts TTS backend, fuzzy model search, long-form TTS chunking for chatterbox, and a complete WebUI overhaul.
We’ve also fixed critical bugs, improved stability, and enhanced compatibility with OpenAI’s APIs.
📌 TL;DR – What’s New in LocalAI 3.7.0
| Feature | Summary |
|---|---|
| 🤖 Agentic MCP Support (WebUI-enabled) | Build AI agents that use real tools (web search, code exec). Fully-OpenAI compatible and integrated into the WebUI. |
| 🎙️ neutts TTS Backend (Neuphonic-powered) | Generate natural, high-quality speech with low-latency audio — ideal for voice assistants. |
| 🖼️ WebUI enhancements | Faster, cleaner UI with real-time updates and full YAML model control. |
| 💬 Long-Text TTS Chunking (Chatterbox) | Generate natural-sounding long-form audio by intelligently splitting text and preserving context. |
| 🧩 Advanced Agent Controls | Fine-tune agent behavior with new options for retries, reasoning, and re-evaluation. |
| 📸 New Video Creation Endpoint | We now support the OpenAI-compatible /v1/videos endpoint for text-to-video generation. |
| 🐍 Enhanced Whisper compatibility | Whisper.cpp is now supported on various CPU variants (AVX, AVX2, etc.) to prevent illegal instruction crashes. |
| 🔍 Fuzzy Gallery Search | Find models in the gallery even with typos (e.g., gema finds gemma). |
| 📦 Easier Model & Backend Management | Import, edit, and delete models directly via clean YAML in the WebUI. |
| Check out the new realtime voice assistant example (multilingual). | |
| Fixed critical crashes, deadlocks, session events, OpenAI compliance, and JSON schema panics. | |
| 🧠 Qwen 3 VL | Support for Qwen 3 VL with llama.cpp/gguf models |
🔥 What’s New in Detail
🤖 Agentic MCP Support – Build Intelligent, Tool-Using AI Agents
We're proud to announce full Agentic MCP support a feature for building AI agents that can reason, plan, and execute actions using external tools like web search, code execution, and data retrieval. You can use standard chat/completions endpoint, but powered by an agent in the background.
Full documentation is available here
✅ Now in WebUI: A dedicated toggle appears in the chat interface when a model supports MCP. Just click to enable agent mode.
✨ Key Features:
- New Endpoint:
POST /mcp/v1/chat/completions(OpenAI-compatible). - Flexible Tool Configuration:
mcp: stdio: | { "mcpServers": { "duckduckgo": { "command": "docker", "args": ["run", "-i", "--rm", "ghcr.io/mudler/mcps/duckduckgo:master"] } } }
- Advanced Agent Control via
agentconfig:agent: max_attempts: 3 max_iterations: 5 enable_reasoning: true enable_re_evaluation: true
max_attempts: Retry failed tool calls up to N times.max_iterations: Limit how many times the agent can loop through reasoning.enable_reasoning: Allow step-by-step thought processes (e.g., chain-of-thought).enable_re_evaluation: Re-analyze decisions when tool results are ambiguous.
You can find some plug-n-play MCPs here: https://github.com/mudler/MCPs
Under the hood, MCP functionality is powered by https://github.com/mudler/cogito
🖼️ WebUI enhancements
WebUI had a major overhaul:
- The chat view now has an MCP toggle in the chat for models that have
mcpsettings enabled in the model config file. - The Editor mask of the model has now been simplified to show/edit the YAML settings of the model
- More reactive, dropped HTMX in favor of Alpine.js and vanilla javascript
- Various fixes including deletion of models
🎙️ Introducing neutts TTS Backend – Natural Speech, Low Latency
Say hello to neutts a new, lightweight TTS backend powered by Neuphonic, delivering high-quality, natural-sounding speech with minimal overhead.
🎛️ Setup Example
name: neutts-english
backend: neutts
parameters:
model: neuphonic/neutts-air
tts:
audio_path: "./output.wav"
streaming: true
options:
# text transcription of the provided audio file
- ref_text: "So I'm live on radio..."
known_usecases:
- tts🐍 Whisper.cpp enhancements
whisper.cpp CPU variants are now available for:
avxavx2avx512fallback(no optimized instructions available)
These variants are optimized for specific instruction sets and reduce crashes on older or non-AVX CPUs.
🔍 Smarter Gallery Search: Fuzzy & Case-Insensitive Matching
Searching for gemma now finds gemma-3, gemma2, etc. — even with typos like gemaa or gema.
🧩 Improved Tool & Schema Handling – No More Crashes
We’ve fixed multiple edge cases that caused crashes or silent failures in tool usage.
✅ Fixes:
- Nullable JSON Schemas:
"type": ["string", "null"]now works without panics. - Empty Parameters: Tools with missing or empty
parametersnow handled gracefully. - Strict Mode Enforcement: When
strict_mode: true, the model must pick a tool — no more skipping. - Multi-Type Arrays: Safe handling of
["string", "null"]in function definitions.
🔄 Interaction with Grammar Triggers:
strict_modeand grammar rules work together — if a tool is required and the function definition is invalid, the server returns a clear JSON error instead of crashing.
📸 New Video Creation Endpoint: OpenAI-Compatible
LocalAI now supports OpenAI’s /v1/videos endpoint for generating videos from text prompts.
📌 Usage Example:
curl http://localhost:8080/v1/videos \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-..." \
-d '{
"model": "sora",
"prompt": "A cat walking through a forest at sunset",
"size": "1024x576",
}'🧠 Qwen 3 VL in llama.cpp
Support has been added for Qwen 3 VL in llama.cpp. We have updated llama.cpp to latest! As a reminder, Qwen 3 VL and multimodal models are also compatible with our vLLM and MLX backends. Qwen 3 VL models are already available in the model gallery:
qwen3-vl-30b-a3b-instructqwen3-vl-30b-a3b-thinkingqwen3-vl-4b-instructqwen3-vl-32b-instructqwen3-vl-4b-thinkingqwen3-vl-2b-thinkingqwen3-vl-2b-instruct
Note: upgrading the llama.cpp backend is necessary if you already have a LocalAI installation.
🚀 (CI) Gallery Updater Agent: Auto-Detect & Suggest New Models
We’ve added an autonomous CI agent that scans Hugging Face daily for new models and opens PRs to update the gallery.
✨ How It Works:
- Scans HF for new, trending models
- Extracts base model, quantization, and metadata.
- Uses cogito (our agentic framework) to assign the model to the correct family and to obtain the model informations.
- Opens a PR with:
- Suggested
name,family, andusecases - Link to HF model
- YAML snippet for import
- Suggested
🔧 Critical Bug Fixes & Stability Improvements
| Issue | Fix | Impact |
|---|---|---|
| 📌 WebUI Crash on Model Load | Fixed can't evaluate field Name in type string error |
Models now render even without config files |
| 🔁 Deadlock in Model Load/Idle Checks | Guarded against race conditions during model loading | Improved performance under load |
| 📞 Realtime API Compliance | Added session.created event; removed redundant conversation.created |
Works with VoxInput, OpenAI clients, and more |
| 📥 MCP Response Formatting | Output wrapped in message field |
Matches OpenAI spec — better client compatibility |
| 🛑 JSON Error Responses | Now return clean JSON instead of HTML | Scripts and libraries no longer break on auth failures |
| 🔄 Session Registration | Fixed initial MCP calls failing due to cache issues | Reliable first-time use |
🎧 kokoro TTS |
Returns full audio, not partial | Better for long-form TTS |
🚀 The Complete Local Stack for Privacy-First AI
LocalAI |
The free, Open Source OpenAI alternative. Acts as a drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required. |
LocalAGI |
A powerful Local AI agent management platform. Serves as a drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI. |
v3.6.0
What's Changed
Bug fixes 🐛
Exciting New Features 🎉
- feat(kokoro): add support for l4t devices by @mudler in #6322
- feat(chatterbox): support multilingual by @mudler in #6240
🧠 Models
- chore(model gallery): add qwen-image-edit-2509 by @mudler in #6336
- chore(models): add whisper-turbo via whisper.cpp by @mudler in #6340
- chore(model gallery): add ibm-granite_granite-4.0-h-small by @mudler in #6373
- chore(model gallery): add ibm-granite_granite-4.0-h-tiny by @mudler in #6374
- chore(model gallery): add ibm-granite_granite-4.0-h-micro by @mudler in #6375
- chore(model gallery): add ibm-granite_granite-4.0-micro by @mudler in #6376
👒 Dependencies
- chore(deps): bump grpcio from 1.74.0 to 1.75.0 in /backend/python/transformers by @dependabot[bot] in #6332
- chore(deps): bump securego/gosec from 2.22.8 to 2.22.9 by @dependabot[bot] in #6324
- chore(deps): bump llama.cpp to '72b24d96c6888c609d562779a23787304ae4609c' by @mudler in #6349
- chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/coqui by @dependabot[bot] in #6353
- chore(deps): bump transformers from 4.48.3 to 4.56.2 in /backend/python/coqui by @dependabot[bot] in #6330
- chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/diffusers by @dependabot[bot] in #6361
- chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/rerankers by @dependabot[bot] in #6360
- chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/common/template by @dependabot[bot] in #6358
- chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/vllm by @dependabot[bot] in #6357
- chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/bark by @dependabot[bot] in #6359
- chore(deps): bump grpcio from 1.75.0 to 1.75.1 in /backend/python/transformers by @dependabot[bot] in #6362
- chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/exllama2 by @dependabot[bot] in #6356
Other Changes
- chore: ⬆️ Update ggml-org/llama.cpp to
7f766929ca8e8e01dcceb1c526ee584f7e5e1408by @localai-bot in #6319 - docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #6318
- chore: ⬆️ Update ggml-org/llama.cpp to
da30ab5f8696cabb2d4620cdc0aa41a298c54fd6by @localai-bot in #6321 - chore: ⬆️ Update ggml-org/llama.cpp to
1d0125bcf1cbd7195ad0faf826a20bc7cec7d3f4by @localai-bot in #6335 - chore(cudss): add cudds to l4t images by @mudler in #6338
- chore: ⬆️ Update ggml-org/llama.cpp to
4ae88d07d026e66b41e85afece74e88af54f4e66by @localai-bot in #6339 - CI: disable build-testing on PRs against arm64 by @mudler in #6341
- chore(deps): bump llama.cpp to '835b2b915c52bcabcd688d025eacff9a07b65f52' by @mudler in #6347
- chore: ⬆️ Update ggml-org/llama.cpp to
4807e8f96a61b2adccebd5e57444c94d18de7264by @localai-bot in #6350 - chore: ⬆️ Update ggml-org/llama.cpp to
bd0af02fc96c2057726f33c0f0daf7bb8f3e462aby @localai-bot in #6352 - Revert "chore(deps): bump transformers from 4.48.3 to 4.56.2 in /backend/python/coqui" by @mudler in #6363
- chore: ⬆️ Update ggml-org/whisper.cpp to
32be14f8ebfc0498c2c619182f0d7f4c822d52c4by @localai-bot in #6354 - chore: ⬆️ Update ggml-org/llama.cpp to
5f7e166cbf7b9ca928c7fad990098ef32358ac75by @localai-bot in #6355 - chore: ⬆️ Update ggml-org/llama.cpp to
b2ba81dbe07b6dbea9c96b13346c66973dede32cby @localai-bot in #6366 - chore: ⬆️ Update ggml-org/whisper.cpp to
8c0855fd6bb115e113c0dca6255ea05f774d35f7by @localai-bot in #6365 - chore: ⬆️ Update ggml-org/whisper.cpp to
7849aff7a2e1f4234aa31b01a1870906d5431959by @localai-bot in #6367 - chore: ⬆️ Update ggml-org/llama.cpp to
1fe4e38cc20af058ed320bd46cac934991190056by @localai-bot in #6368 - chore: ⬆️ Update ggml-org/llama.cpp to
d64c8104f090b27b1f99e8da5995ffcfa6b726e2by @localai-bot in #6371
New Contributors
Full Changelog: v3.5.4...v3.6.0
v3.5.4
What's Changed
Bug fixes 🐛
Other Changes
- chore: ⬆️ Update ggml-org/whisper.cpp to
44fa2f647cf2a6953493b21ab83b50d5f5dbc483by @localai-bot in #6317 - chore: ⬆️ Update ggml-org/llama.cpp to
f432d8d83e7407073634c5e4fd81a3d23a10827fby @localai-bot in #6316 - docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #6315
Full Changelog: v3.5.3...v3.5.4
v3.5.3
What's Changed
Bug fixes 🐛
🧠 Models
- chore(model gallery): add mistralai_magistral-small-2509 by @mudler in #6309
- chore(model gallery): add impish_qwen_14b-1m by @mudler in #6310
- chore(model gallery): add aquif-3.5-a4b-think by @mudler in #6311
👒 Dependencies
- chore: ⬆️ Update ggml-org/llama.cpp to
3edd87cd055a45d885fa914d879d36d33ecfc3e1by @localai-bot in #6308
Other Changes
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #6307
Full Changelog: v3.5.2...v3.5.3
v3.5.2
What's Changed
👒 Dependencies
Other Changes
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #6305
- chore: ⬆️ Update ggml-org/llama.cpp to
0320ac5264279d74f8ee91bafa6c90e9ab9bbb91by @localai-bot in #6306
Full Changelog: v3.5.1...v3.5.2
