Skip to content

feat: add KV cache type configuration and extraArgs escape hatch#256

Merged
Defilan merged 2 commits intomainfrom
feat/cache-type-extra-args
Apr 1, 2026
Merged

feat: add KV cache type configuration and extraArgs escape hatch#256
Defilan merged 2 commits intomainfrom
feat/cache-type-extra-args

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented Apr 1, 2026

Summary

  • Add cacheTypeK and cacheTypeV fields to InferenceService CRD for KV cache quantization type configuration (maps to llama.cpp --cache-type-k / --cache-type-v)
  • Add extraArgs field as an escape hatch for arbitrary llama-server flags
  • CLI flags: --cache-type-k, --cache-type-v, --extra-args
  • Controller helpers follow existing appendContextSizeArgs/appendJinjaArgs patterns

Motivated by TurboQuant benchmarking where we had to build custom wrapper images to inject cache type flags. With these CRD fields, users can configure KV cache quantization directly:

spec:
  modelRef: my-model
  cacheTypeK: q4_0
  cacheTypeV: q4_0
  extraArgs:
    - "--seed"
    - "42"

Non-breaking: all fields are optional with empty defaults.

Test plan

  • make test passes (324 lines added, 6 new test cases)
  • make manifests generate clean
  • Deploy with --cache-type-k q4_0 --cache-type-v q4_0, verify args in pod spec
  • Deploy without new flags, verify no change in behavior
  • Deploy with --extra-args="--seed,42", verify args appended

Fixes #252, #253

Defilan added 2 commits March 31, 2026 20:16
Add cacheTypeK and cacheTypeV fields to InferenceService CRD for
configuring llama.cpp KV cache quantization (--cache-type-k/v flags).
Supports f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1, and iq4_nl types.

Add extraArgs field as an escape hatch for passing arbitrary
llama-server flags not yet exposed as typed CRD fields.

Both features include CLI flags (--cache-type-k, --cache-type-v,
--extra-args), controller arg helpers, and full test coverage.

Non-breaking: all fields are optional with empty defaults. Existing
deployments are unaffected.

Fixes #252, #253

Signed-off-by: Christopher Maher <chris@mahercode.io>
Moves deploy summary printing into its own function to bring
runDeploy cyclomatic complexity below the gocyclo threshold of 30.

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan merged commit 7a4b855 into main Apr 1, 2026
16 checks passed
@Defilan Defilan deleted the feat/cache-type-extra-args branch April 1, 2026 03:45
@github-actions github-actions bot mentioned this pull request Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add KV cache type configuration to InferenceService CRD

1 participant