feat: add Ollama as runtime backend for Metal agent#258
Merged
Conversation
Add Ollama as a third runtime option for the Metal agent alongside
llama-server and oMLX. Ollama is the most widely adopted local LLM
runtime (200K+ GitHub stars) and recently switched to MLX backend
for Apple Silicon, providing significant speedups.
The OllamaExecutor manages models through Ollama's REST API:
- Model pull via POST /api/pull (handles downloads internally)
- Pre-load via POST /api/generate with empty prompt
- Readiness check via GET /api/ps
- Unload via POST /api/generate with keep_alive: 0
- Health via GET / ("Ollama is running")
Includes model name mapping from LLMKube catalog names to Ollama
tags (e.g., llama-3.2-3b -> llama3.2:3b) for all 16 catalog models.
No CRD changes needed since Ollama uses GGUF format natively.
Usage:
llmkube-metal-agent --runtime ollama
llmkube deploy llama-3.2-3b --gpu --accelerator metal
Signed-off-by: Christopher Maher <chris@mahercode.io>
Signed-off-by: Christopher Maher <chris@mahercode.io>
Documents the --runtime ollama flag, prerequisites, usage, model name mapping, and differences from llama-server and oMLX backends. Signed-off-by: Christopher Maher <chris@mahercode.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Ollama as a third runtime option for the Metal agent (
--runtime ollama) alongside the existingllama-serverandomlxbackends.Ollama is the most widely adopted local LLM runtime (200K+ stars) and recently switched to MLX for Apple Silicon inference. Most Mac users already have it installed, making this the lowest-friction path to fast local inference with LLMKube.
Key advantages over the other backends:
/api/pull)How it works:
OllamaExecutormanages models through Ollama's REST APIPOST /api/pullwith 5-min timeout for large downloadsPOST /api/generatewith empty prompt (blocks until loaded)GET /api/psverifies model is in memoryPOST /api/generatewithkeep_alive: 0llama-3.2-3b->llama3.2:3b)Usage:
No CRD changes. Default runtime remains
llama-server. Fully backward compatible.Test plan
make testpasses--runtime ollama, deploy model, verify inference/api/psempty)--runtime llama-serverunchanged--runtime omlxstill worksRuntime comparison
/api/pull)Builds on #257 (ProcessExecutor interface). Closes #248 (MLX backend goal achieved through both oMLX and Ollama paths).