feat: add KV cache type configuration and extraArgs escape hatch by Defilan · Pull Request #256 · defilantech/LLMKube

Defilan · 2026-04-01T03:17:00Z

Summary

Add cacheTypeK and cacheTypeV fields to InferenceService CRD for KV cache quantization type configuration (maps to llama.cpp --cache-type-k / --cache-type-v)
Add extraArgs field as an escape hatch for arbitrary llama-server flags
CLI flags: --cache-type-k, --cache-type-v, --extra-args
Controller helpers follow existing appendContextSizeArgs/appendJinjaArgs patterns

Motivated by TurboQuant benchmarking where we had to build custom wrapper images to inject cache type flags. With these CRD fields, users can configure KV cache quantization directly:

spec:
  modelRef: my-model
  cacheTypeK: q4_0
  cacheTypeV: q4_0
  extraArgs:
    - "--seed"
    - "42"

Non-breaking: all fields are optional with empty defaults.

Test plan

make test passes (324 lines added, 6 new test cases)
make manifests generate clean
Deploy with --cache-type-k q4_0 --cache-type-v q4_0, verify args in pod spec
Deploy without new flags, verify no change in behavior
Deploy with --extra-args="--seed,42", verify args appended

Fixes #252, #253

Add cacheTypeK and cacheTypeV fields to InferenceService CRD for configuring llama.cpp KV cache quantization (--cache-type-k/v flags). Supports f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1, and iq4_nl types. Add extraArgs field as an escape hatch for passing arbitrary llama-server flags not yet exposed as typed CRD fields. Both features include CLI flags (--cache-type-k, --cache-type-v, --extra-args), controller arg helpers, and full test coverage. Non-breaking: all fields are optional with empty defaults. Existing deployments are unaffected. Fixes #252, #253 Signed-off-by: Christopher Maher <chris@mahercode.io>

Moves deploy summary printing into its own function to bring runDeploy cyclomatic complexity below the gocyclo threshold of 30. Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan added 2 commits March 31, 2026 20:16

refactor: extract printDeploySummary to reduce runDeploy complexity

22103fd

Moves deploy summary printing into its own function to bring runDeploy cyclomatic complexity below the gocyclo threshold of 30. Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan merged commit 7a4b855 into main Apr 1, 2026
16 checks passed

Defilan deleted the feat/cache-type-extra-args branch April 1, 2026 03:45

github-actions bot mentioned this pull request Apr 1, 2026

chore: release 0.5.3 #255

Merged

Defilan mentioned this pull request Apr 1, 2026

feat: add extraArgs escape hatch for arbitrary llama-server flags #253

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add KV cache type configuration and extraArgs escape hatch#256

feat: add KV cache type configuration and extraArgs escape hatch#256
Defilan merged 2 commits intomainfrom
feat/cache-type-extra-args

Defilan commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Defilan commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Defilan commented Apr 1, 2026 •

edited

Loading