Skip to content

Releases: mudler/LocalAI

v3.11.0

07 Feb 21:31
944874d

Choose a tag to compare

🎉 LocalAI 3.11.0 Release! 🚀




LocalAI 3.11.0 is a massive update for Audio and Multimodal capabilities.

We are introducing Realtime Audio Conversations, a dedicated Music Generation UI, and a massive expansion of ASR (Speech-to-Text) and TTS backends. Whether you want to talk to your AI, clone voices, transcribe with speaker identification, or generate songs, this release has you covered.

Check out the highlights below!


📌 TL;DR

Feature Summary
Realtime Audio Native support for audio conversations, enabling fluid voice interactions similar to OpenAI's Realtime API. Documentation
Music Generation UI New UI interface for MusicGen (Ace-Step), allowing you to generate music from text prompts directly in the browser.
New ASR Backends Added WhisperX (with Speaker Diarization), VibeVoice, Qwen-ASR, and Nvidia NeMo.
TTS Streaming Text-to-Speech now supports streaming mode for lower latency responses. (VoxCPM only for now)
vLLM Omni Added support for vLLM Omni, expanding our high-performance inference capabilities.
Speaker Diarization Native support for identifying different speakers in transcriptions via WhisperX.
Hardware Expansion Expanded build support for CUDA 12/13, L4T (Jetson), SBSA, and better Metal (Apple Silicon) integration with MLX backends
Breaking Changes ExLlama (deprecated) and Bark (unmaintained) backends have been removed.

🚀 New Features & Major Enhancements

🎙️ Realtime Audio Conversations

LocalAI 3.11.0 introduces native support for Realtime Audio Conversations.

  • Enables fluid, low-latency voice interaction with agents.
  • Logic handled directly within the LocalAI pipeline for seamless audio-in/audio-out workflows.
  • Support for STT/TTS and voice-to-voice models (experimental)
  • Support for tool calls

🗣️ Talk to your LocalAI: This brings us one step closer to a fully local, voice-native assistant experience compatible with standard client implementations.

Check here for detailed documentation.


🎵 Music Generation UI & Ace-Step

We have added a dedicated interface for music generation!

  • New Backend: Support for Ace-Step (MusicGen) via the ace-step backend.
  • Web UI Integration: Generate musical clips directly from the LocalAI Web UI.
  • Simple text-to-music workflow (e.g., "Lo-fi hip hop beat for studying").
Screenshot 2026-02-07 at 23-32-00 LocalAI - Generate sound with ace-step-turbo

🎧 Massive ASR (Speech-to-Text) Expansion

This release significantly broadens our transcription capabilities with four new backends:

  1. WhisperX: Provides fast transcription with Speaker Diarization (identifying who is speaking).
  2. VibeVoice: Now supports also ASR alongside TTS.
  3. Qwen-ASR: Support for Qwen's powerful speech recognition models.
  4. Nvidia NeMo: Initial support for NeMo ASR.

🗣️ TTS Streaming & New Voices

Text-to-Speech gets a speed boost and new options:

  • Streaming Support: TTS endpoints now support streaming, reducing the "time-to-first-audio" significantly.
  • VoxCPM: Added support for the VoxCPM backend.
  • Qwen-TTS: Added support for Qwen-TTS models
  • Piper Voices: Added most remaining Piper voices from Hugging Face to the gallery.

🛠️ Hardware & Backend Updates

  • vLLM Omni: A new backend integration for vLLM Omni models.
  • Extended Platform Support: Major work on MLX to improve compatibility across CUDA 12, CUDA 13, L4T (Nvidia Jetson), SBSA, and macOS Metal.
  • GGUF Cleanup: Dropped redundant VRAM estimation logic for GGUF loading, relying on more accurate internal measurements.

⚠️ Breaking Changes

To keep the project lean and maintainable, we have removed some older backends:

  • ExLlama: Removed (deprecated in favor of newer loaders like ExLlamaV2 or llama.cpp).
  • Bark: Removed (the upstream project is unmaintained; we recommend using the new TTS alternatives).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI Logo

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI Logo

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall Logo

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link: https://github.com/mudler/LocalRecall


❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

  • Star the repo
  • 💬 Contribute code, docs, or feedback
  • 📣 Share with others

Your support keeps this stack alive.


✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Breaking Changes 🛠

  • chore(exllama): drop backend now almost deprecated by @mudler in #8186

Bug fixes 🐛

  • fix(ui): correctly display selected image model by @dedyf5 in #8208
  • fix(ui): take account of reasoning in token count calculation by @mudler in #8324
  • fix: drop gguf VRAM estimation (now redundant) by @mudler in #8325
  • fix(api): Add missing field in initial OpenAI streaming response by @acon96 in #8341
  • fix(realtime): Include noAction function in prompt template and handle tool_choice by @richiejp in #8372
  • fix: filter GGUF and GGML files from model list by @Yaroslav98214 in #8397
  • fix(qwen-asr): Remove contagious slop (DEFAULT_GOAL) from Makefile by @richiejp in #8431

Exciting New Features 🎉

  • feat(vllm-omni): add new backend by @mudler in #8188
  • feat(vibevoice): add ASR support by @mudler in #8222
  • feat: add VoxCPM tts backend by @mudler in #8109
  • feat(realtime): Add audio conversations by @richiejp in #6245
  • feat(qwen-asr): add support to qwen-asr by @mudler in #8281
  • feat(tts): add support for streaming mode by @mudler in #8291
  • feat(api): Add transcribe response format request parameter & adjust STT backends by @nanoandrew4 in #8318
  • feat(whisperx): add whisperx backend for transcription with speaker diarization by @eureka928 in #8299
  • feat(mlx): Add support for CUDA12, CUDA13, L4T, SBSA and CPU by @mudler in #8380
  • feat(musicgen): add ace-step and UI interface by @mudler in #8396
  • fix(api)!: Stop model prior to deletion by @nanoandrew4 in #8422
  • feat(nemo): add Nemo (only asr for now) backend by @mudler in #8436

🧠 Models

  • chore(model gallery): add qwen3-tts to model gallery by @mudler in #8187
  • chore(model gallery): Add most of not yet present Piper voices from Hugging Face by @rampa3 in #8202
  • chore: drop bark which is unmaintained by @mudler in #8207
  • chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8220
  • chore(model gallery): Add entry for Mistral Small 3.1 with mmproj by @rampa3 in https://git...
Read more

v3.10.1

23 Jan 14:21
923ebbb

Choose a tag to compare

This is a small patch release intended to provide bugfixes and minor polishment, along, we also added support to Qwen-TTS that was just released yesterday.

  • Fix reasoning detection on reasoning and instruct models
  • Support reasoning blocks with openresponses
  • API fixes to correctly run LTX-2
  • Support Qwen3-TTS!

What's Changed

Bug fixes 🐛

  • fix(reasoning): support models with reasoning without starting thinking tag by @mudler in #8132
  • fix(tracing): Create trace buffer on first request to enable tracing at runtime by @richiejp in #8148
  • fix(videogen): drop incomplete endpoint, add GGUF support for LTX-2 by @mudler in #8160

Exciting New Features 🎉

  • feat(openresponses): Support reasoning blocks by @mudler in #8133
  • feat: detect thinking support from backend automatically if not explicitly set by @mudler in #8167
  • feat(qwen-tts): add Qwen-tts backend by @mudler in #8163

🧠 Models

  • chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8128
  • chore(model gallery): add flux 2 and flux 2 klein by @mudler in #8141
  • chore(model-gallery): ⬆️ update checksum by @localai-bot in #8153
  • chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8157
  • chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #8170

👒 Dependencies

  • chore(deps): bump github.com/mudler/cogito from 0.7.2 to 0.8.1 by @dependabot[bot] in #8124

Other Changes

  • feat(swagger): update swagger by @localai-bot in #8098
  • chore: ⬆️ Update ggml-org/llama.cpp to 287a33017b32600bfc0e81feeb0ad6e81e0dd484 by @localai-bot in #8100
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 2efd19978dd4164e387bf226025c9666b6ef35e2 by @localai-bot in #8099
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #8120
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to a48b4a3ade9972faf0adcad47e51c6fc03f0e46d by @localai-bot in #8121
  • chore: ⬆️ Update ggml-org/llama.cpp to 959ecf7f234dc0bc0cd6829b25cb0ee1481aa78a by @localai-bot in #8122
  • chore(deps): Bump llama.cpp to '1c7cf94b22a9dc6b1d32422f72a627787a4783a3' by @mudler in #8136
  • chore: drop noisy logs by @mudler in #8142
  • chore: ⬆️ Update ggml-org/llama.cpp to ad8d85bd94cc86e89d23407bdebf98f2e6510c61 by @localai-bot in #8145
  • chore: ⬆️ Update ggml-org/whisper.cpp to 7aa8818647303b567c3a21fe4220b2681988e220 by @localai-bot in #8146
  • feat(swagger): update swagger by @localai-bot in #8150
  • chore(diffusers): add 'av' to requirements.txt by @mudler in #8155
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 329571131d62d64a4f49e1acbef49ae02544fdcd by @localai-bot in #8152
  • chore: ⬆️ Update ggml-org/llama.cpp to c301172f660a1fe0b42023da990bf7385d69adb4 by @localai-bot in #8151
  • chore: ⬆️ Update ggml-org/llama.cpp to a5eaa1d6a3732bc0f460b02b61c95680bba5a012 by @localai-bot in #8165
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 5e4579c11d0678f9765463582d024e58270faa9c by @localai-bot in #8166

Full Changelog: v3.10.0...v3.10.1

v3.10.0

18 Jan 21:00
5f403b1

Choose a tag to compare

🎉 LocalAI 3.10.0 Release! 🚀




LocalAI 3.10.0 is big on agent capabilities, multi-modal support, and cross-platform reliability.

We've added native Anthropic API support, launched a new Video Generation UI, introduced Open Responses API compatibility, and enhanced performance with a unified GPU backend system.

For a full tour, see below!


📌 TL;DR

Feature Summary
Anthropic API Support Fully compatible /v1/messages endpoint for seamless drop-in replacement of Claude.
Open Responses API Native support for stateful agents with tool calling, streaming, background mode, and multi-turn conversations, passing all official acceptance tests.
Video & Image Generation Suite New video gen UI + LTX-2 support for text-to-video and image-to-video.
Unified GPU Backends GPU libraries (CUDA, ROCm, Vulkan) packaged inside backend containers — works out of the box on Nvidia, AMD, and ARM64 (Experimental).
Tool Streaming & XML Parsing Full support for streaming tool calls and XML-formatted tool outputs.
System-Aware Backend Gallery Only see backends your system can run (e.g., hide MLX on Linux).
Crash Fixes Prevents crashes on AVX-only CPUs (Intel Sandy/Ivy Bridge) and fixes VRAM reporting on AMD GPUs.
Request Tracing Debug agents & fine-tuning with memory-based request/response logging.
Moonshine Backend Ultra-fast transcription engine for low-end devices.
Pocket-TTS Lightweight, high-fidelity text-to-speech with voice cloning.
Vulkan arm64 builds We now build backends and images for vulkan on arm64 as well

🚀 New Features & Major Enhancements

🤖 Open Responses API: Build Smarter, Autonomous Agents

LocalAI now supports the OpenAI Responses API, enabling powerful agentic workflows locally.

  • Stateful conversations via response_id — resume and manage long-running agent sessions.
  • Background mode: Run agents asynchronously and fetch results later.
  • Streaming support for tools, images, and audio.
  • Built-in tools: Web search, file search, and computer use (via MCP integrations).
  • Multi-turn interaction with dynamic context and tool use.

✅ Ideal for developers building agents that can browse, analyze files, or interact with systems — all on your local machine.

🔧 How to Use:

  • Set response_id in your request to maintain session state across calls.
  • Use background: true to run agents asynchronously.
  • Retrieve results via GET /api/v1/responses/{response_id}.
  • Enable streaming with stream: true to receive partial responses and tool calls in real time.

📌 Tip: Use response_id to build agent orchestration systems that persist context and avoid redundant computation.

Our support passes all the official acceptance tests:

Open Responses API support

🧠 Anthropic Messages API: Clone Claude Locally

LocalAI now fully supports the Anthropic messages API.

  • Use https://api.localai.host/v1/messages as a drop-in replacement for Claude.
  • Full tool/function calling support, just like OpenAI.
  • Streaming and non-streaming responses.
  • Compatible with anthropic-sdk-go, LangChain, and other tooling.

🔥 Perfect for teams migrating from Anthropic to local inference with full feature parity.


🎥 Video Generation: From Text to Video in the Web UI

  • New dedicated video generation page with intuitive controls.
  • LTX-2 is supported
  • Supports text-to-video and image-to-video workflows.
  • Built on top of diffusers with full compatibility.

📌 How to Use:

  • Go to /video in the web UI.
  • Enter a prompt (e.g., "A cat walking on a moonlit rooftop").
  • Optionally upload an image for image-to-video generation.
  • Adjust parameters like fps, num_frames, and guidance_scale.

⚙️ Unified GPU Backends: Acceleration Works Out of the Box

A major architectural upgrade: GPU libraries (CUDA, ROCm, Vulkan) are now packaged inside backend containers.

  • Single image: Now you don't need anymore to pull a specific image for your GPU. Any image works regardless if you have a GPU or not.
  • No more manual GPU driver setup — just run the image and get acceleration.
  • Works on Nvidia (CUDA), AMD (ROCm), and ARM64 (Vulkan).
  • Vulkan arm64 builds enabled
  • Reduced image complexity, faster builds, and consistent performance.

🚀 This means latest/master images now support GPU acceleration on all platforms — no extra config!

Note: this is experimental, please help us by filing an issue if something doesn't work!


🧩 Tool Streaming & Advanced Parsing

Enhance your agent workflows with richer tool interaction.

  • Streaming tool calls: Receive partial tool arguments in real time (e.g., input_json_delta).
  • XML-style tool call parsing: Models that return tools in XML format (<function>...</function>) are now properly parsed alongside text.
  • Works across all backends (llama.cpp, vLLM, diffusers, etc.).

💡 Enables more natural, real-time interaction with agents that use structured tool outputs.


🌐 System-Aware Backend Gallery: Only Compatible Backends Show

The backend gallery now shows only backends your system can run.

  • Auto-detects system capabilities (CPU, GPU, MLX, etc.).
  • Hides unsupported backends (e.g., MLX on Linux, CUDA on AMD).
  • Shows detected capabilities in the hero section.

🎤 New TTS Backends: Pocket-TTS

Add expressive voice generation to your apps with Pocket-TTS.

  • Real-time text-to-speech with voice cloning support (requires HF login).
  • Lightweight, fast, and open-source.
  • Available in the model gallery.

🗣️ Perfect for voice agents, narrators, or interactive assistants.
Note: Voice cloning requires HF authentication and a registered voice model.


🔍 Request Tracing: Debug Your Agents

Trace requests and responses in memory — great for fine-tuning and agent debugging.

  • Enable via runtime setting or API.
  • Log stored in memory, dropped after max size.
  • Fetch logs via GET /api/v1/trace.
  • Export to JSON for analysis.

🪄 New 'Reasoning' Field: Extract Thinking Steps

LocalAI now automatically detects and extracts thinking tags from model output.

  • Supports both SSE and non-SSE modes.
  • Displays reasoning steps in the chat UI (under "Thinking" tab).
  • Fixes issue where thinking content appeared as part of final answer.

🚀 Moonshine Backend: Faster Transcription for Low-End Devices

Add Moonshine, an ONNX-based transcription engine, for fast, lightweight speech-to-text.

  • Optimized for low-end devices (Raspberry Pi, older laptops).
  • One of the fastest transcription engines available.
  • Supports live transcription.

🛠️ Fixes & Stability Improvements

🔧 Prevent BMI2 Crashes on AVX-Only CPUs

Fixed crashes on older Intel CPUs (Ivy Bridge, Sandy Bridge) that lack BMI2 instructions.

  • Now safely falls back to llama-cpp-fallback (SSE2 only).
  • No more EOF errors during model warmup.

✅ Ensures LocalAI runs smoothly on older hardware.


📊 Fix Swapped VRAM Usage on AMD GPUs

Correctly parses rocm-smi output: used and total VRAM are now displayed correctly.

  • Fixes misreported memory usage on dual-Radeon setups.
  • Handles HIP_VISIBLE_DEVICES properly (e.g., when using only discrete GPU).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI Logo

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI Logo

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall Logo

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link:

Read more

v3.9.0

24 Dec 14:31
aadec0b

Choose a tag to compare

Xmas-release 🎅 LocalAI 3.9.0! 🚀




LocalAI 3.9.0 is focused on stability, resource efficiency, and smarter agent workflows. We've addressed critical issues with model loading, improved system resource management, and introduced a new Agent Jobs panel for scheduling and managing background agentic tasks. Whether you're running models locally or orchestrating complex agent workflows, this release makes it faster, more reliable, and easier to manage.

📌 TL;DR

Feature Summary
Agent Jobs Panel Schedule and run background tasks with cron or via API — perfect for automated workflows.
Smart Memory Reclaimer Automatically frees up GPU/VRAM by evicting least recently used models when memory is low.
LRU Model Eviction Models are automatically unloaded from memory based on usage patterns to prevent crashes.
MLX & CUDA 13 Support New model backends and enhanced GPU compatibility for modern hardware.
UI Polish & Fixes Cleaned-up navigation, fixed layout overflow, and various improvements.
Vibevoice Added support for the vibevoice backend!

🚀 New Features

🤖 Agent Jobs Panel: Schedule & Automate Tasks

LocalAI 3.9.0 introduces a new Agent Jobs panel in the web UI and API, allowing you to create, run, and schedule agentic tasks in the background that can be started programmatically via API or from the Web interface.

  • Run agent prompts on a schedule using cron syntax, or via API.
  • Agents are defined via the model settings, supporting MCP.
  • Trigger jobs via API for integration into CI/CD or external tools.
  • Optionally send results to a webhook for post-processing.
  • Templates and prompts can be dynamically populated with variables.

✅ Use cases: Daily reports, CI integration, automated data processing, scheduled model evaluations.

Screenshot 2025-12-24 at 15-26-32 LocalAI - Agent Jobs

🧠 Smart Memory Reclaimer: Auto-Optimize GPU Resources

We’ve introduced a new Memory Reclaimer that monitors system memory usage and automatically frees up GPU/VRAM when needed.

Screenshot 2025-12-24 at 15-25-30 LocalAI API - 8b3e0eb (8b3e0ebf8aab4071ef7721121f04081c32a5c9da)
  • Tracks memory consumption across all backends.
  • When usage exceeds a configured threshold, it evicts the least recently used (LRU) models.
  • Prevents out-of-memory crashes and keeps your system stable during high load.

This is a step toward adaptive resource management, future versions will expand this with more advanced policies and giving more control.


🔁 LRU Model Eviction: Intelligent Model Management

Building on the new reclaimer, LocalAI now supports LRU (Least Recently Used) eviction for loaded models.

Screenshot 2025-12-24 at 15-27-24 LocalAI - Settings
  • Set a maximum number of models to keep in memory (e.g., limit to 3).
  • When a new model is loaded and the limit is reached, the oldest unused model is automatically unloaded.
  • Fully compatible with single_active_backend mode (now defaults to LRU=1 for backward compatibility).

💡 Ideal for servers with limited VRAM or when running multiple models in parallel.


🖥️ UI & UX Polish

  • Fixed navbar ordering and login icon — clearer navigation and better visual flow.
  • Prevented tool call overflow in chat view — no more clipped or misaligned content.
  • Uniformed link paths (e.g., /browse/ instead of browse) for consistency.
  • Fixed model selection toggle — header updates correctly when switching models.
  • Consistent button styling — uniform colors, hover effects, and accessibility.

📦 Backward Compatibility & Architecture

  • Dropped x86_64 Mac support: no longer maintained in GitHub Actions; ARM64 (M1/M2/M3/M4) is now the recommended architecture.
  • Updated data storage path from /usr/share to /var/lib: follows Linux conventions for mutable data.
  • Added CUDA 13 support: now available in Docker images and L4T builds.
  • New VibeVoice TTS backend real-time text-to-speech with voice cloning support. You can install it from the model gallery!
  • StableDiffusion-GGML now supports LoRA: expand your image-generation capabilities.

🛠️ Fixes & Improvements

  • Issue: After v3.8.0, /readyz and /healthz endpoints required authentication, breaking Docker health checks and monitoring tools
  • Issue: Fixed crashes when importing models from Hugging Face URLs with subfolders (e.g., huggingface://user/model/GGUF/model.gguf).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI Logo

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI Logo

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall Logo

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link: https://github.com/mudler/LocalRecall


❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

  • Star the repo
  • 💬 Contribute code, docs, or feedback
  • 📣 Share with others

Your support keeps this stack alive.


✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Breaking Changes 🛠

  • chore: switch from /usr/share to /var/lib for data storage by @poretsky in #7361
  • chore: drop drawin-x86_64 support by @mudler in #7616

Bug fixes 🐛

  • fix: do not require auth for readyz/healthz endpoints by @mudler in #7403
  • fix(ui): navbar ordering and login icon by @mudler in #7407
  • fix: configure sbsa packages for arm64 by @mudler in #7413
  • fix(ui): prevent box overflow in chat view by @mudler in #7430
  • fix(ui): Update few links in web UI from 'browse' to '/browse/' by @rampa3 in #7445
  • fix(paths): remove trailing slash from requests by @mudler in #7451
  • fix(downloader): do not download model files if not necessary by @mudler in #7492
  • fix(config): make syncKnownUsecasesFromString idempotent by @mudler in #7493
  • fix: make sure to close on errors by @mudler in #7521
  • fix(llama.cpp): handle corner cases with tool array content by @mudler in #7528
  • fix(7355): Update llama-cpp grpc for v3 interface by @sredman in #7566
  • fix(chat-ui): model selection toggle and new chat by @mudler in #7574
  • fix: improve ram estimation by @mudler in #7603
  • fix(ram): do not read from cgroup by @mudler in #7606
  • fix: correctly propagate error during model load by @mudler in #7610
  • fix(ci): remove specific version for grpcio packages by @mudler in #7627
  • fix(uri): consider subfolders when expanding huggingface URLs by @mintyleaf in #7634

Exciting New Features 🎉

Read more

v3.8.0

26 Nov 20:22
c0d1d02

Choose a tag to compare




Welcome to LocalAI 3.8.0 !

LocalAI 3.8.0 focuses on smoothing out the user experience and exposing more power to the user without requiring restarts or complex configuration files. This release introduces a new onboarding flow and a universal model loader that handles everything from HF URLs to local files.

We’ve also improved the chat interface, addressed long-standing requests regarding OpenAI API compatibility (specifically SSE streaming standards) and exposed more granular controls for some backends (llama.cpp) and backend management.

📌 TL;DR

Feature Summary
Universal Model Import Import directly from Hugging Face, Ollama, OCI, or local paths. Auto-detects backends and handles chat templates.
UI & Index Overhaul New onboarding wizard, auto-model selection on boot, and a cleaner tabular view for model management.
MCP Live Streaming New: Agent actions and tool calls are now streamed live via the Model Context Protocol—see reasoning in real-time.
Hot-Reloadable Settings Modify watchdogs, API keys, P2P settings, and defaults without restarting the container.
Chat enhancements Chat history and parallel conversations are now persisted in local storage.
Strict SSE Compliance Fixed streaming format to exactly match OpenAI specs (resolves issues with LangChain/JS clients).
Advanced Config Fine-tune context_shift, cache_ram, and parallel workers via YAML options.
Logprobs & Logitbias Added token-level probability support for improved agent/eval workflows.

Feature Breakdown

🚀 Universal Model Import (URL-based)

We have refactored how models are imported. You no longer need to manually write configuration files for common use cases. The new importer accepts URLs from Hugging Face, Ollama, and OCI registries, or local file paths also from the Web interface.

import.mp4
  • Auto-Detection: The system attempts to identify the correct backend (e.g., llama.cpp vs diffusers) and applies native chat templates (e.g., llama-3, mistral) automatically by reading the model metadata.
  • Customization during Import: You can override defaults immediately, for example, forcing a specific quantization on a GGUF file or selecting vLLM over transformers.
  • Multimodal Support: Vision components (mmproj) are detected and configured automatically.
  • File Safety: We added a safeguard to prevent the deletion of model files (blobs) if they are shared by multiple model configurations.

🎨 Complete UI Overhaul

The web interface has been redesigned for better usability and clearer management.

index.mp4
  • Onboarding Wizard: A guided flow helps first-time users import or install a model in under 30 seconds.
  • Auto-Focus & Selection: The input field captures focus automatically, and a default model is loaded on startup so you don't start in a "no model selected" state.
  • Tabular Management: Models and backends are now organized in a cleaner list view, making it easier to see what is installed.
manage.mp4

🤖 Agentic Ecosystem & MCP Live Streaming

LocalAI 3.8.0 significantly upgrades support for agentic workflows using the Model Context Protocol (MCP).

  • Live Action Streaming: We have added a new endpoint to stream agent results as they happen. Instead of waiting for the final output, you can now watch the agent "think": seeing tool calls, reasoning steps, and intermediate actions streamed live in the UI.
mcp.mp4

Configuring MCP via the interface is now simplified:

mcp_configuration.mp4

🔁 Runtime System Settings

A new Settings > System panel exposes configuration options that previously required environment variables or a restart.

settings.mp4
  • Immediate Effect: Toggling Watchdogs, P2P, and Gallery availability applies instantly.
  • API Key Management: You can now generate, rotate, and expire API keys via the UI.
  • Network: CORS and CSRF settings are now accessible here (note: these specific network settings still require a restart to take effect).

Note: In order to benefit from persisting runtime settings, in older LocalAI deployments it's necessary to mount the /configuration directory from the container image.


⚙️ Advanced llama.cpp Configuration

For power users running large context windows or high-throughput setups, we've exposed additional underlying llama.cpp options in the YAML config. You can now tune context shifting, RAM limits for the KV cache, and parallel worker slots.

options:
- context_shift:false
- cache_ram:-1
- use_jinja:true
- parallel:2
- grpc_servers:localhost:50051,localhost:50052

📊 Logprobs & Logitbias Support

This release adds full support for logitbias and logprobs. This is critical for advanced agentic logic, Self-RAG, and evaluating model confidence / hallucination rates. It supports the OpenAI specification.


🛠️ Fixes & Improvements

OpenAI Compatibility:

  • SSE Streaming: Fixed a critical issue where streaming responses were slightly non-compliant (e.g., sending empty content chunks or missing finish_reason). This resolves integration issues with openai-node, LangChain, and LlamaIndex.
  • Top_N Behavior: In the reranker, top_n can now be omitted or set to 0 to return all results, rather than defaulting to an arbitrary limit.

General Fixes:

  • Model Preview: When downloading, you can now see the actual filename and size before committing to the download.
  • Tool Handling: Fixed crashes when tool content is missing or malformed.
  • TTS: Fixed dropdown selection states for TTS models.
  • Browser Storage: Chat history is now persisted in your browser's local storage. You can switch between parallel chats, rename them, and export them to JSON.
  • True Cancellation: Clicking "Stop" during a stream now correctly propagates a cancellation context to the backend (works for llama.cpp, vLLM, transformers, and diffusers). This immediately stops generation and frees up resources.

🚀 The Complete Local Stack for Privacy-First AI

LocalAI Logo

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI Logo

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall Logo

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link: https://github.com/mudler/LocalRecall


❤️ Thank You

Over 35,000 stars and growing. LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

  • Star the repo
  • 💬 Contribute code, docs, or feedback
  • 📣 Share with others

Your support keeps this stack alive.

✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Bug fixes 🐛

Read more

v3.7.0

31 Oct 21:34
9ecfdc5

Choose a tag to compare




Welcome to LocalAI 3.7.0 👋

This release introduces Agentic MCP support with full WebUI integration, a brand-new neutts TTS backend, fuzzy model search, long-form TTS chunking for chatterbox, and a complete WebUI overhaul.

We’ve also fixed critical bugs, improved stability, and enhanced compatibility with OpenAI’s APIs.


📌 TL;DR – What’s New in LocalAI 3.7.0

Feature Summary
🤖 Agentic MCP Support (WebUI-enabled) Build AI agents that use real tools (web search, code exec). Fully-OpenAI compatible and integrated into the WebUI.
🎙️ neutts TTS Backend (Neuphonic-powered) Generate natural, high-quality speech with low-latency audio — ideal for voice assistants.
🖼️ WebUI enhancements Faster, cleaner UI with real-time updates and full YAML model control.
💬 Long-Text TTS Chunking (Chatterbox) Generate natural-sounding long-form audio by intelligently splitting text and preserving context.
🧩 Advanced Agent Controls Fine-tune agent behavior with new options for retries, reasoning, and re-evaluation.
📸 New Video Creation Endpoint We now support the OpenAI-compatible /v1/videos endpoint for text-to-video generation.
🐍 Enhanced Whisper compatibility Whisper.cpp is now supported on various CPU variants (AVX, AVX2, etc.) to prevent illegal instruction crashes.
🔍 Fuzzy Gallery Search Find models in the gallery even with typos (e.g., gema finds gemma).
📦 Easier Model & Backend Management Import, edit, and delete models directly via clean YAML in the WebUI.
▶️ Realtime Example Check out the new realtime voice assistant example (multilingual).
⚠️ Security, Stability & API Compliance Fixed critical crashes, deadlocks, session events, OpenAI compliance, and JSON schema panics.
🧠 Qwen 3 VL Support for Qwen 3 VL with llama.cpp/gguf models

🔥 What’s New in Detail

🤖 Agentic MCP Support – Build Intelligent, Tool-Using AI Agents

We're proud to announce full Agentic MCP support a feature for building AI agents that can reason, plan, and execute actions using external tools like web search, code execution, and data retrieval. You can use standard chat/completions endpoint, but powered by an agent in the background.

Full documentation is available here

Now in WebUI: A dedicated toggle appears in the chat interface when a model supports MCP. Just click to enable agent mode.

✨ Key Features:

  • New Endpoint: POST /mcp/v1/chat/completions (OpenAI-compatible).
  • Flexible Tool Configuration:
    mcp:
      stdio: |
        {
          "mcpServers": {
            "duckduckgo": {
              "command": "docker",
              "args": ["run", "-i", "--rm", "ghcr.io/mudler/mcps/duckduckgo:master"]
            }
          }
        }
  • Advanced Agent Control via agent config:
    agent:
      max_attempts: 3
      max_iterations: 5
      enable_reasoning: true
      enable_re_evaluation: true
    • max_attempts: Retry failed tool calls up to N times.
    • max_iterations: Limit how many times the agent can loop through reasoning.
    • enable_reasoning: Allow step-by-step thought processes (e.g., chain-of-thought).
    • enable_re_evaluation: Re-analyze decisions when tool results are ambiguous.

You can find some plug-n-play MCPs here: https://github.com/mudler/MCPs
Under the hood, MCP functionality is powered by https://github.com/mudler/cogito

🖼️ WebUI enhancements

WebUI had a major overhaul:

  • The chat view now has an MCP toggle in the chat for models that have mcp settings enabled in the model config file.
  • The Editor mask of the model has now been simplified to show/edit the YAML settings of the model
  • More reactive, dropped HTMX in favor of Alpine.js and vanilla javascript
  • Various fixes including deletion of models

🎙️ Introducing neutts TTS Backend – Natural Speech, Low Latency

Say hello to neutts a new, lightweight TTS backend powered by Neuphonic, delivering high-quality, natural-sounding speech with minimal overhead.

🎛️ Setup Example

name: neutts-english
backend: neutts
parameters:
  model: neuphonic/neutts-air
tts:
  audio_path: "./output.wav"
  streaming: true
options:
  # text transcription of the provided audio file
  - ref_text: "So I'm live on radio..."
known_usecases:
  - tts

🐍 Whisper.cpp enhancements

whisper.cpp CPU variants are now available for:

  • avx
  • avx2
  • avx512
  • fallback (no optimized instructions available)

These variants are optimized for specific instruction sets and reduce crashes on older or non-AVX CPUs.

🔍 Smarter Gallery Search: Fuzzy & Case-Insensitive Matching

Searching for gemma now finds gemma-3, gemma2, etc. — even with typos like gemaa or gema.

🧩 Improved Tool & Schema Handling – No More Crashes

We’ve fixed multiple edge cases that caused crashes or silent failures in tool usage.

✅ Fixes:

  • Nullable JSON Schemas: "type": ["string", "null"] now works without panics.
  • Empty Parameters: Tools with missing or empty parameters now handled gracefully.
  • Strict Mode Enforcement: When strict_mode: true, the model must pick a tool — no more skipping.
  • Multi-Type Arrays: Safe handling of ["string", "null"] in function definitions.

🔄 Interaction with Grammar Triggers: strict_mode and grammar rules work together — if a tool is required and the function definition is invalid, the server returns a clear JSON error instead of crashing.

📸 New Video Creation Endpoint: OpenAI-Compatible

LocalAI now supports OpenAI’s /v1/videos endpoint for generating videos from text prompts.

📌 Usage Example:

curl http://localhost:8080/v1/videos \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-..." \
  -d '{
    "model": "sora",
    "prompt": "A cat walking through a forest at sunset",
    "size": "1024x576",
  }'

🧠 Qwen 3 VL in llama.cpp

Support has been added for Qwen 3 VL in llama.cpp. We have updated llama.cpp to latest! As a reminder, Qwen 3 VL and multimodal models are also compatible with our vLLM and MLX backends. Qwen 3 VL models are already available in the model gallery:

  • qwen3-vl-30b-a3b-instruct
  • qwen3-vl-30b-a3b-thinking
  • qwen3-vl-4b-instruct
  • qwen3-vl-32b-instruct
  • qwen3-vl-4b-thinking
  • qwen3-vl-2b-thinking
  • qwen3-vl-2b-instruct

Note: upgrading the llama.cpp backend is necessary if you already have a LocalAI installation.

🚀 (CI) Gallery Updater Agent: Auto-Detect & Suggest New Models

We’ve added an autonomous CI agent that scans Hugging Face daily for new models and opens PRs to update the gallery.

✨ How It Works:

  1. Scans HF for new, trending models
  2. Extracts base model, quantization, and metadata.
  3. Uses cogito (our agentic framework) to assign the model to the correct family and to obtain the model informations.
  4. Opens a PR with:
    • Suggested name, family, and usecases
    • Link to HF model
    • YAML snippet for import

🔧 Critical Bug Fixes & Stability Improvements

Issue Fix Impact
📌 WebUI Crash on Model Load Fixed can't evaluate field Name in type string error Models now render even without config files
🔁 Deadlock in Model Load/Idle Checks Guarded against race conditions during model loading Improved performance under load
📞 Realtime API Compliance Added session.created event; removed redundant conversation.created Works with VoxInput, OpenAI clients, and more
📥 MCP Response Formatting Output wrapped in message field Matches OpenAI spec — better client compatibility
🛑 JSON Error Responses Now return clean JSON instead of HTML Scripts and libraries no longer break on auth failures
🔄 Session Registration Fixed initial MCP calls failing due to cache issues Reliable first-time use
🎧 kokoro TTS Returns full audio, not partial Better for long-form TTS

🚀 The Complete Local Stack for Privacy-First AI

LocalAI Logo

LocalAI

The free, Open Source OpenAI alternative. Acts as a drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI Logo

LocalAGI

A powerful Local AI agent management platform. Serves as a drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

Read more

v3.6.0

03 Oct 13:08
8fb9568

Choose a tag to compare

What's Changed

Bug fixes 🐛

  • fix: reranking models limited to 512 tokens in llama.cpp backend by @jongames in #6344

Exciting New Features 🎉

  • feat(kokoro): add support for l4t devices by @mudler in #6322
  • feat(chatterbox): support multilingual by @mudler in #6240

🧠 Models

  • chore(model gallery): add qwen-image-edit-2509 by @mudler in #6336
  • chore(models): add whisper-turbo via whisper.cpp by @mudler in #6340
  • chore(model gallery): add ibm-granite_granite-4.0-h-small by @mudler in #6373
  • chore(model gallery): add ibm-granite_granite-4.0-h-tiny by @mudler in #6374
  • chore(model gallery): add ibm-granite_granite-4.0-h-micro by @mudler in #6375
  • chore(model gallery): add ibm-granite_granite-4.0-micro by @mudler in #6376

👒 Dependencies

  • chore(deps): bump grpcio from 1.74.0 to 1.75.0 in /backend/python/transformers by @dependabot[bot] in #6332
  • chore(deps): bump securego/gosec from 2.22.8 to 2.22.9 by @dependabot[bot] in #6324
  • chore(deps): bump llama.cpp to '72b24d96c6888c609d562779a23787304ae4609c' by @mudler in #6349
  • chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/coqui by @dependabot[bot] in #6353
  • chore(deps): bump transformers from 4.48.3 to 4.56.2 in /backend/python/coqui by @dependabot[bot] in #6330
  • chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/diffusers by @dependabot[bot] in #6361
  • chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/rerankers by @dependabot[bot] in #6360
  • chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/common/template by @dependabot[bot] in #6358
  • chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/vllm by @dependabot[bot] in #6357
  • chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/bark by @dependabot[bot] in #6359
  • chore(deps): bump grpcio from 1.75.0 to 1.75.1 in /backend/python/transformers by @dependabot[bot] in #6362
  • chore(deps): bump grpcio from 1.74.0 to 1.75.1 in /backend/python/exllama2 by @dependabot[bot] in #6356

Other Changes

  • chore: ⬆️ Update ggml-org/llama.cpp to 7f766929ca8e8e01dcceb1c526ee584f7e5e1408 by @localai-bot in #6319
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #6318
  • chore: ⬆️ Update ggml-org/llama.cpp to da30ab5f8696cabb2d4620cdc0aa41a298c54fd6 by @localai-bot in #6321
  • chore: ⬆️ Update ggml-org/llama.cpp to 1d0125bcf1cbd7195ad0faf826a20bc7cec7d3f4 by @localai-bot in #6335
  • chore(cudss): add cudds to l4t images by @mudler in #6338
  • chore: ⬆️ Update ggml-org/llama.cpp to 4ae88d07d026e66b41e85afece74e88af54f4e66 by @localai-bot in #6339
  • CI: disable build-testing on PRs against arm64 by @mudler in #6341
  • chore(deps): bump llama.cpp to '835b2b915c52bcabcd688d025eacff9a07b65f52' by @mudler in #6347
  • chore: ⬆️ Update ggml-org/llama.cpp to 4807e8f96a61b2adccebd5e57444c94d18de7264 by @localai-bot in #6350
  • chore: ⬆️ Update ggml-org/llama.cpp to bd0af02fc96c2057726f33c0f0daf7bb8f3e462a by @localai-bot in #6352
  • Revert "chore(deps): bump transformers from 4.48.3 to 4.56.2 in /backend/python/coqui" by @mudler in #6363
  • chore: ⬆️ Update ggml-org/whisper.cpp to 32be14f8ebfc0498c2c619182f0d7f4c822d52c4 by @localai-bot in #6354
  • chore: ⬆️ Update ggml-org/llama.cpp to 5f7e166cbf7b9ca928c7fad990098ef32358ac75 by @localai-bot in #6355
  • chore: ⬆️ Update ggml-org/llama.cpp to b2ba81dbe07b6dbea9c96b13346c66973dede32c by @localai-bot in #6366
  • chore: ⬆️ Update ggml-org/whisper.cpp to 8c0855fd6bb115e113c0dca6255ea05f774d35f7 by @localai-bot in #6365
  • chore: ⬆️ Update ggml-org/whisper.cpp to 7849aff7a2e1f4234aa31b01a1870906d5431959 by @localai-bot in #6367
  • chore: ⬆️ Update ggml-org/llama.cpp to 1fe4e38cc20af058ed320bd46cac934991190056 by @localai-bot in #6368
  • chore: ⬆️ Update ggml-org/llama.cpp to d64c8104f090b27b1f99e8da5995ffcfa6b726e2 by @localai-bot in #6371

New Contributors

Full Changelog: v3.5.4...v3.6.0

v3.5.4

20 Sep 07:49
f7f26b8

Choose a tag to compare

What's Changed

Bug fixes 🐛

  • fix(python): make option check uniform across backends by @mudler in #6314

Other Changes

  • chore: ⬆️ Update ggml-org/whisper.cpp to 44fa2f647cf2a6953493b21ab83b50d5f5dbc483 by @localai-bot in #6317
  • chore: ⬆️ Update ggml-org/llama.cpp to f432d8d83e7407073634c5e4fd81a3d23a10827f by @localai-bot in #6316
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #6315

Full Changelog: v3.5.3...v3.5.4

v3.5.3

19 Sep 17:10
c27da0a

Choose a tag to compare

What's Changed

Bug fixes 🐛

🧠 Models

  • chore(model gallery): add mistralai_magistral-small-2509 by @mudler in #6309
  • chore(model gallery): add impish_qwen_14b-1m by @mudler in #6310
  • chore(model gallery): add aquif-3.5-a4b-think by @mudler in #6311

👒 Dependencies

  • chore: ⬆️ Update ggml-org/llama.cpp to 3edd87cd055a45d885fa914d879d36d33ecfc3e1 by @localai-bot in #6308

Other Changes

Full Changelog: v3.5.2...v3.5.3

v3.5.2

18 Sep 07:37
902e47f

Choose a tag to compare

What's Changed

👒 Dependencies

  • Revert "feat(nvidia-gpu): bump images to cuda 12.8" by @mudler in #6303

Other Changes

  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #6305
  • chore: ⬆️ Update ggml-org/llama.cpp to 0320ac5264279d74f8ee91bafa6c90e9ab9bbb91 by @localai-bot in #6306

Full Changelog: v3.5.1...v3.5.2