Skip to content

feat: add Ollama as runtime backend for Metal agent#258

Merged
Defilan merged 3 commits intomainfrom
feat/ollama-backend
Apr 1, 2026
Merged

feat: add Ollama as runtime backend for Metal agent#258
Defilan merged 3 commits intomainfrom
feat/ollama-backend

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented Apr 1, 2026

Summary

Adds Ollama as a third runtime option for the Metal agent (--runtime ollama) alongside the existing llama-server and omlx backends.

Ollama is the most widely adopted local LLM runtime (200K+ stars) and recently switched to MLX for Apple Silicon inference. Most Mac users already have it installed, making this the lowest-friction path to fast local inference with LLMKube.

Key advantages over the other backends:

  • No model format changes needed (Ollama uses GGUF internally)
  • No binary path management (Ollama manages itself)
  • Model downloads handled by Ollama (/api/pull)
  • Most users already have Ollama installed

How it works:

  • OllamaExecutor manages models through Ollama's REST API
  • Model pull: POST /api/pull with 5-min timeout for large downloads
  • Pre-load: POST /api/generate with empty prompt (blocks until loaded)
  • Readiness: GET /api/ps verifies model is in memory
  • Unload: POST /api/generate with keep_alive: 0
  • Includes model name mapping for all 16 catalog models (e.g., llama-3.2-3b -> llama3.2:3b)

Usage:

# Ensure Ollama is running
ollama serve

# Start Metal agent with Ollama runtime
llmkube-metal-agent --runtime ollama

# Deploy as usual
llmkube deploy llama-3.2-3b --gpu --accelerator metal

No CRD changes. Default runtime remains llama-server. Fully backward compatible.

Test plan

  • make test passes
  • Start agent with --runtime ollama, deploy model, verify inference
  • Verify model pull works for a model not yet downloaded
  • Delete InferenceService, verify model unloads (/api/ps empty)
  • Default --runtime llama-server unchanged
  • --runtime omlx still works

Runtime comparison

llama-server oMLX Ollama
Model format GGUF MLX GGUF
Model download Manual/init container Manual Automatic (/api/pull)
Install base llama.cpp users Small 200K+ users
CRD changes None MLX format added None
Process model One per model One daemon One daemon

Builds on #257 (ProcessExecutor interface). Closes #248 (MLX backend goal achieved through both oMLX and Ollama paths).

Defilan added 3 commits April 1, 2026 08:51
Add Ollama as a third runtime option for the Metal agent alongside
llama-server and oMLX. Ollama is the most widely adopted local LLM
runtime (200K+ GitHub stars) and recently switched to MLX backend
for Apple Silicon, providing significant speedups.

The OllamaExecutor manages models through Ollama's REST API:
- Model pull via POST /api/pull (handles downloads internally)
- Pre-load via POST /api/generate with empty prompt
- Readiness check via GET /api/ps
- Unload via POST /api/generate with keep_alive: 0
- Health via GET / ("Ollama is running")

Includes model name mapping from LLMKube catalog names to Ollama
tags (e.g., llama-3.2-3b -> llama3.2:3b) for all 16 catalog models.

No CRD changes needed since Ollama uses GGUF format natively.

Usage:
  llmkube-metal-agent --runtime ollama
  llmkube deploy llama-3.2-3b --gpu --accelerator metal

Signed-off-by: Christopher Maher <chris@mahercode.io>
Signed-off-by: Christopher Maher <chris@mahercode.io>
Documents the --runtime ollama flag, prerequisites, usage, model
name mapping, and differences from llama-server and oMLX backends.

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan merged commit 6148b89 into main Apr 1, 2026
16 checks passed
@Defilan Defilan deleted the feat/ollama-backend branch April 1, 2026 16:45
@github-actions github-actions bot mentioned this pull request Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: MLX backend support for Metal agent

1 participant