Problem
Currently, when a Model or InferenceService is deleted and recreated, the model file is re-downloaded from the source URL. For large models (13B-70B), this means:
- 26-40GB+ downloads each time
- 10-30+ minutes waiting for downloads
- Wasted bandwidth and potential rate limiting
- Poor benchmarking experience - can't iterate quickly
Proposed Solution
Implement persistent model storage using Kubernetes PersistentVolumeClaims (PVCs):
Option 1: Shared Model Cache PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llmkube-model-cache
namespace: llmkube-system
spec:
accessModes:
- ReadWriteMany # NFS or similar for shared access
resources:
requests:
storage: 100Gi
Models are downloaded once to the cache, and pods mount it read-only.
Option 2: Per-Model PVCs
Each Model resource gets its own PVC that persists across InferenceService deletions.
Option 3: Node-local Cache
Use hostPath or local PVs to cache models on GPU nodes (faster, but node-specific).
Implementation Details
-
Model Controller Changes:
- Check if model exists in cache before downloading
- Store models in
<cache-pvc>/models/<model-hash>/model.gguf
- Use SHA256 of source URL as cache key
-
InferenceService Controller Changes:
- Mount model cache as read-only volume
- Reference cached model path instead of downloading
-
CLI Changes:
llmkube cache list - Show cached models
llmkube cache clear - Clear model cache
llmkube cache preload <model-id> - Pre-download model to cache
-
Helm Chart Changes:
- Add PVC template for model cache
- Configurable storage class and size
Benefits
- Faster iteration - Deploy/delete/redeploy in seconds
- Bandwidth savings - Download once, use many times
- Better benchmarking - Quick model switching
- Cost reduction - Less egress from HuggingFace
Related
- Roadmap Q1 2026: "Persistent model storage (stop re-downloading!)"
- Supports air-gapped deployments (pre-populate cache)
- Enables
llmkube benchmark --catalog to run efficiently
Success Criteria
Problem
Currently, when a Model or InferenceService is deleted and recreated, the model file is re-downloaded from the source URL. For large models (13B-70B), this means:
Proposed Solution
Implement persistent model storage using Kubernetes PersistentVolumeClaims (PVCs):
Option 1: Shared Model Cache PVC
Models are downloaded once to the cache, and pods mount it read-only.
Option 2: Per-Model PVCs
Each Model resource gets its own PVC that persists across InferenceService deletions.
Option 3: Node-local Cache
Use
hostPathorlocalPVs to cache models on GPU nodes (faster, but node-specific).Implementation Details
Model Controller Changes:
<cache-pvc>/models/<model-hash>/model.ggufInferenceService Controller Changes:
CLI Changes:
llmkube cache list- Show cached modelsllmkube cache clear- Clear model cachellmkube cache preload <model-id>- Pre-download model to cacheHelm Chart Changes:
Benefits
Related
llmkube benchmark --catalogto run efficientlySuccess Criteria