Skip to content

Persistent model storage - avoid re-downloading models #52

@Defilan

Description

@Defilan

Problem

Currently, when a Model or InferenceService is deleted and recreated, the model file is re-downloaded from the source URL. For large models (13B-70B), this means:

  • 26-40GB+ downloads each time
  • 10-30+ minutes waiting for downloads
  • Wasted bandwidth and potential rate limiting
  • Poor benchmarking experience - can't iterate quickly

Proposed Solution

Implement persistent model storage using Kubernetes PersistentVolumeClaims (PVCs):

Option 1: Shared Model Cache PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llmkube-model-cache
  namespace: llmkube-system
spec:
  accessModes:
    - ReadWriteMany  # NFS or similar for shared access
  resources:
    requests:
      storage: 100Gi

Models are downloaded once to the cache, and pods mount it read-only.

Option 2: Per-Model PVCs

Each Model resource gets its own PVC that persists across InferenceService deletions.

Option 3: Node-local Cache

Use hostPath or local PVs to cache models on GPU nodes (faster, but node-specific).

Implementation Details

  1. Model Controller Changes:

    • Check if model exists in cache before downloading
    • Store models in <cache-pvc>/models/<model-hash>/model.gguf
    • Use SHA256 of source URL as cache key
  2. InferenceService Controller Changes:

    • Mount model cache as read-only volume
    • Reference cached model path instead of downloading
  3. CLI Changes:

    • llmkube cache list - Show cached models
    • llmkube cache clear - Clear model cache
    • llmkube cache preload <model-id> - Pre-download model to cache
  4. Helm Chart Changes:

    • Add PVC template for model cache
    • Configurable storage class and size

Benefits

  • Faster iteration - Deploy/delete/redeploy in seconds
  • Bandwidth savings - Download once, use many times
  • Better benchmarking - Quick model switching
  • Cost reduction - Less egress from HuggingFace

Related

  • Roadmap Q1 2026: "Persistent model storage (stop re-downloading!)"
  • Supports air-gapped deployments (pre-populate cache)
  • Enables llmkube benchmark --catalog to run efficiently

Success Criteria

  • Model downloaded only once per unique source URL
  • Deleting InferenceService preserves cached model
  • Cache survives controller restarts
  • CLI commands for cache management
  • Documentation for cache configuration

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions