This directory contains the macOS launchd configuration for the LLMKube Metal Agent, which enables Metal GPU acceleration for local Kubernetes LLM deployments.
- macOS with Apple Silicon (M1/M2/M3/M4) or Intel Mac with Metal 2+ support
- Access to a Kubernetes cluster — either a remote cluster (recommended) or local minikube
- llama.cpp with Metal support:
brew install llama.cpp
- LLMKube operator installed in your cluster:
kubectl apply -f https://github.com/defilantech/llmkube/releases/latest/download/install.yaml
--host-ipflag (required when using a remote cluster): the Metal Agent must be started with--host-ip <your-mac-ip>so that Kubernetes endpoints point to the Mac's reachable IP address instead oflocalhost
# Build and install Metal agent
make install-metal-agentThis will:
- Build the Metal agent binary
- Install to
/usr/local/bin/llmkube-metal-agent - Install launchd service
- Start the service automatically
# Build the agent
make build-metal-agent
# Copy to /usr/local/bin
sudo cp bin/llmkube-metal-agent /usr/local/bin/
# Install launchd plist
mkdir -p ~/Library/LaunchAgents
cp deployment/macos/com.llmkube.metal-agent.plist ~/Library/LaunchAgents/
# Load the service
launchctl load ~/Library/LaunchAgents/com.llmkube.metal-agent.plistOnce installed, the Metal agent runs automatically in the background and watches for InferenceService resources in your Kubernetes cluster.
# Deploy from catalog
llmkube deploy llama-3.1-8b --accelerator metal
# Or deploy custom model
llmkube deploy my-model --accelerator metal \
--source https://huggingface.co/.../model.gguf# Check if agent is running
launchctl list | grep llmkube
# View agent logs
tail -f /tmp/llmkube-metal-agent.log
# Check running processes
ps aux | grep llmkube-metal-agent
# Health check (liveness)
curl http://localhost:9090/healthz
# Readiness check (at least one process healthy, or no processes yet)
curl http://localhost:9090/readyz# Check Metal support
system_profiler SPDisplaysDataType | grep Metal
# Monitor GPU usage while inference is running
sudo powermetrics --samplers gpu_power -i 1000The launchd plist can be customized by editing com.llmkube.metal-agent.plist:
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/llmkube-metal-agent</string>
<string>--namespace</string>
<string>default</string> <!-- Kubernetes namespace to watch -->
<string>--model-store</string>
<string>/tmp/llmkube-models</string> <!-- Where to store downloaded models -->
<string>--llama-server</string>
<string>/usr/local/bin/llama-server</string> <!-- Path to llama-server binary -->
<string>--port</string>
<string>9090</string> <!-- Agent metrics port -->
</array>When your Kubernetes cluster runs on a different machine (Linux server, cloud, etc.), the Metal Agent needs to register the Mac's reachable IP address so that pods in the cluster can route traffic to llama-server:
# Find your Mac's IP on the local network
ipconfig getifaddr en0
# Start the agent with --host-ip
llmkube-metal-agent --host-ip 192.168.1.50
# Or with a Tailscale / WireGuard address
llmkube-metal-agent --host-ip 100.64.0.10Without --host-ip, the agent registers localhost as the endpoint — which only works when K8s is on the same machine (e.g. minikube).
To set this in the launchd plist, add these lines to the ProgramArguments array:
<string>--host-ip</string>
<string>192.168.1.50</string> <!-- Your Mac's reachable IP -->After editing, reload the service:
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
launchctl load ~/Library/LaunchAgents/com.llmkube.metal-agent.plistThe Metal Agent estimates model memory requirements (weights + KV cache + overhead) before starting llama-server. If the model won't fit in the memory budget, the agent refuses to start it and sets the InferenceService status to InsufficientMemory.
By default, the budget is auto-detected based on total system RAM:
| Total RAM | Default Fraction | Budget |
|---|---|---|
| 16 GB | 67% | ~10.7 GB |
| 36 GB | 67% | ~24.1 GB |
| 48 GB | 75% | 36 GB |
| 64 GB | 75% | 48 GB |
To override:
# Use 50% of memory (conservative, leaves room for other apps)
llmkube-metal-agent --memory-fraction 0.5
# Use 90% of memory (dedicated inference machine)
llmkube-metal-agent --memory-fraction 0.9To set this in the launchd plist:
<string>--memory-fraction</string>
<string>0.75</string> <!-- 75% of system memory -->The Metal Agent exposes an HTTP server on 127.0.0.1:9090 (configurable via --port) with health check and Prometheus metrics endpoints. The server binds to localhost only; to expose it for remote Prometheus scraping, use a reverse proxy or SSH tunnel.
| Endpoint | Purpose | Success | Failure |
|---|---|---|---|
GET /healthz |
Liveness probe — agent process is alive | Always 200 | — |
GET /readyz |
Readiness probe — at least one process healthy (or no processes) | 200 | 503 (all unhealthy) |
GET /metrics |
Prometheus metrics | 200 | — |
| Metric | Type | Description |
|---|---|---|
llmkube_metal_agent_managed_processes |
Gauge | Number of llama-server processes currently managed |
llmkube_metal_agent_process_healthy |
Gauge | Whether a process is healthy (1) or not (0). Labels: name, namespace |
llmkube_metal_agent_process_restarts_total |
Counter | Total process restarts triggered by health monitoring. Labels: name, namespace |
llmkube_metal_agent_health_check_duration_seconds |
Histogram | Duration of health check probes. Labels: name, namespace |
llmkube_metal_agent_memory_budget_bytes |
Gauge | Total memory budget for model serving |
llmkube_metal_agent_memory_estimated_bytes |
Gauge | Estimated memory per process. Labels: name, namespace |
Standard Go runtime and process metrics (go_*, process_*) are also available.
The agent polls each managed llama-server process every 30 seconds via its /health endpoint. On failure:
- The process is marked unhealthy (
Healthy=false,process_healthygauge set to 0) - The agent re-fetches the InferenceService from Kubernetes
ensureProcess()is called to restart the llama-server- The
process_restarts_totalcounter is incremented
When a previously unhealthy process recovers, it is marked healthy again automatically.
The health server binds to 127.0.0.1 by default. If Prometheus runs on the same Mac, scrape directly:
scrape_configs:
- job_name: 'llmkube-metal-agent'
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'metal-agent'For remote Prometheus, use an SSH tunnel: ssh -L 9090:localhost:9090 <your-mac>.
Quick verification:
# Check all endpoints
curl http://localhost:9090/healthz # → "ok"
curl http://localhost:9090/readyz # → "ready" or "not ready"
curl http://localhost:9090/metrics # → Prometheus text format
# Check specific metric
curl -s http://localhost:9090/metrics | grep llmkube_metal_agent_managed_processes# Check logs
cat /tmp/llmkube-metal-agent.log
# Verify llama-server is installed
which llama-server
# Verify Metal support
llmkube-metal-agent --version# Verify GPU info
system_profiler SPDisplaysDataType
# Check for Metal support
system_profiler SPDisplaysDataType | grep "Metal"The Metal Agent performs a pre-flight memory check before starting each model. If the estimated memory exceeds the budget, the InferenceService status will show InsufficientMemory:
# Check the scheduling status
kubectl get inferenceservices -o wide
# View the detailed message
kubectl get isvc <name> -o jsonpath='{.status.schedulingMessage}'To resolve:
- Use a smaller quantization (e.g. Q4_K_M instead of Q8_0) to reduce model weight size
- Reduce context size in the InferenceService spec to lower KV cache requirements
- Increase the memory fraction with
--memory-fraction 0.9if this is a dedicated inference machine - Close other applications to free unified memory
# Verify kubectl can reach your cluster
kubectl get nodes
# Check which context is active
kubectl config current-context
# Check kubeconfig path
echo $KUBECONFIG
# If using minikube locally
minikube status# Verify --host-ip was set correctly
# The IP must be reachable from the K8s nodes
ping <your-mac-ip> # run from a K8s node
# Check that the endpoint was registered with the right IP
kubectl get endpoints -l llmkube.dev/accelerator=metal
# Verify firewall isn't blocking the llama-server port (default 8080+)
# macOS may prompt to allow incoming connections on first run
# If using Tailscale / WireGuard, verify the tunnel is up
tailscale status # or wg show# Using Makefile
make uninstall-metal-agent
# Or manually
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
sudo rm /usr/local/bin/llmkube-metal-agent
rm ~/Library/LaunchAgents/com.llmkube.metal-agent.plist- Metal Agent runs as a native macOS process (not in Kubernetes)
- Watches for InferenceService resources in Kubernetes
- Downloads models from HuggingFace when needed
- Validates that the model fits in the system's memory budget
- Spawns llama-server processes with Metal acceleration
- Registers service endpoints back to Kubernetes
- Monitors process health every 30s and auto-restarts on failure
- Exposes health checks and Prometheus metrics on port 9090
- Pods access the Metal-accelerated inference via Service endpoints
K8s runs on a Linux server or cloud; the Mac dedicates all resources to inference:
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ Linux Server / Cloud │ │ macOS (Your Mac) │
│ │ │ │
│ ┌────────────────────────┐ │ LAN/ │ ┌────────────────────────┐ │
│ │ Kubernetes │ │ VPN/ │ │ Metal Agent │ │
│ │ LLMKube Operator │ │ TLS │ │ --host-ip <mac-ip> │ │
│ │ InferenceService CRD │◄─┼────────┼─►│ Watches K8s API │ │
│ │ Service → Mac IP │ │ │ │ Spawns llama-server │ │
│ └────────────────────────┘ │ │ └────────────────────────┘ │
│ │ │ ↓ │
│ │ │ ┌────────────────────────┐ │
│ │ │ │ llama-server (Metal) │ │
│ │ │ │ Direct GPU access ✅ │ │
│ │ │ │ All unified memory │ │
│ │ │ └────────────────────────┘ │
└──────────────────────────────┘ └──────────────────────────────┘
Everything on one machine — simpler but minikube consumes resources:
┌─────────────────────────────────────────────────┐
│ macOS (Your Mac) │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ Minikube (Kubernetes in VM) │ │
│ │ - Creates InferenceService CRD │ │
│ │ - Service points to host │ │
│ └──────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────┐ │
│ │ Metal Agent (Native Process) │ │
│ │ - Watches K8s for InferenceService │ │
│ │ - Spawns llama-server with Metal │ │
│ └──────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────┐ │
│ │ llama-server (Metal Accelerated) │ │
│ │ - Runs on localhost:8080+ │ │
│ │ - Direct Metal GPU access ✅ │ │
│ └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
The Metal Agent supports an alternative runtime using oMLX, an MLX-based inference server for Apple Silicon. oMLX provides roughly 40% faster generation compared to llama-server Metal on the same hardware.
Install oMLX via Homebrew:
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlxDownload an MLX-format model (models from the mlx-community HuggingFace org):
pip install huggingface-hub
huggingface-cli download mlx-community/Llama-3.2-3B-Instruct-4bit \
--local-dir ~/.omlx/models/Llama-3.2-3B-Instruct-4bitStart the Metal Agent with the oMLX runtime:
llmkube-metal-agent --runtime omlx --model-store ~/.omlx/modelsDeploy an MLX model:
kubectl apply -f - <<EOF
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: llama-3b-mlx
spec:
source: /path/to/models/Llama-3.2-3B-Instruct-4bit
format: mlx
hardware:
accelerator: metal
gpu:
enabled: true
count: 1
---
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: llama-3b-mlx
spec:
modelRef: llama-3b-mlx
replicas: 1
EOFThe agent will start the oMLX daemon, load the model, and register the endpoint.
| llama-server | oMLX | |
|---|---|---|
| Model format | GGUF | MLX (safetensors) |
| Process model | One per model | One daemon, all models |
| Memory management | Pre-flight estimation | LRU eviction |
| Metrics | Prometheus native | JSON only |
llmkube-metal-agent \
--runtime omlx \
--model-store ~/.omlx/models \
--omlx-port 8000 \ # oMLX server port (default: 8000)
--omlx-bin /path/to/omlx # Auto-detected from Homebrew if not setThe Metal Agent also supports Ollama as a runtime backend. Since Ollama 0.19 uses MLX natively on Apple Silicon, this gives you fast inference with the tool most Mac users already have installed.
Install Ollama if you haven't already:
brew install ollamaStart Ollama (if not already running as a menu bar app):
ollama serveStart the Metal Agent with the Ollama runtime:
llmkube-metal-agent --runtime ollamaDeploy a model. The agent will pull the model through Ollama automatically:
llmkube deploy llama-3.2-3b --gpu --accelerator metalThe agent maps LLMKube catalog names to Ollama model tags (e.g., llama-3.2-3b becomes llama3.2:3b). If the model isn't already downloaded, Ollama pulls it from the Ollama registry.
| llama-server | oMLX | Ollama | |
|---|---|---|---|
| Model format | GGUF | MLX | GGUF (via Ollama registry) |
| Model download | Manual / init container | Manual | Automatic (/api/pull) |
| Install base | llama.cpp users | Small | Most Mac users |
| CRD changes needed | None | MLX format | None |
llmkube-metal-agent \
--runtime ollama \
--ollama-port 11434 # Ollama server port (default: 11434)Expected performance on M4 Max (32 GPU cores):
- Llama 3.2 3B: 80-120 tok/s (llama-server), ~115 tok/s (oMLX/Ollama MLX)
- Llama 3.1 8B: 40-60 tok/s (llama-server)
- Mistral 7B: 45-65 tok/s (llama-server)
oMLX and Ollama (0.19+) both use Apple's MLX framework for Apple Silicon inference.
- Agent runs as your user (not root)
- Models stored in
/tmp/llmkube-models(configurable) - Processes bind to localhost only
- Service endpoints use ClusterIP (not exposed externally)
- GitHub Issues: https://github.com/defilantech/llmkube/issues
- Documentation: https://github.com/defilantech/llmkube#metal-support