Skip to content

add kueue components as an option#490

Draft
kannon92 wants to merge 2 commits intoNVIDIA:mainfrom
kannon92:issue-486-kueue-component
Draft

add kueue components as an option#490
kannon92 wants to merge 2 commits intoNVIDIA:mainfrom
kannon92:issue-486-kueue-component

Conversation

@kannon92
Copy link
Copy Markdown

@kannon92 kannon92 commented Apr 4, 2026

Summary

Add kueue recipe.

Motivation / Context

Fixes: #486
Related:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@kannon92 kannon92 requested a review from a team as a code owner April 4, 2026 03:27
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Kevin, thanks for adding this. A couple things to fix before this is ready.

healthCheck:
assertFile: checks/kueue/health-check.yaml
helm:
defaultRepository: oci://registry.k8s.io/kueue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path is wrong. The bundler constructs {Repository}/{ChartName}, so this resolves to oci://registry.k8s.io/kueue/kueue but the chart actually lives at oci://registry.k8s.io/kueue/charts/kueue.

Should be:

defaultRepository: oci://registry.k8s.io/kueue/charts

You can verify with:

helm show chart oci://registry.k8s.io/kueue/charts/kueue --version v0.17.0


controllerManager:
tolerations:
- operator: Exists
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth checking what the upstream chart defaults are for resource requests and security context. If they look good, a short comment here saying so would help (see kai-scheduler/values.yaml for an example of documenting why values are minimal). If not, we should pin them explicitly.

@kannon92
Copy link
Copy Markdown
Author

kannon92 commented Apr 6, 2026

I'm having a hard time figuring out how to test this on a kind cluster.

I was thinking that I could have a bundle and verify simple kind cluster for Kueue but can't figure out how to do that.

Signed-off-by: Kevin Hannon <kehannon@redhat.com>
@kannon92 kannon92 force-pushed the issue-486-kueue-component branch from 9a20673 to 3f702e0 Compare April 7, 2026 00:30
@ArangoGutierrez
Copy link
Copy Markdown
Contributor

I'm having a hard time figuring out how to test this on a kind cluster.

I was thinking that I could have a bundle and verify simple kind cluster for Kueue but can't figure out how to do that.

In response to this comment, I have created #508 , with it, is easier for contributors to fully test the component in a controlled environment, even if the contributor doesn't have access to a GPU by using the NVML-Mock project.

Hope this helps @kannon92

@yuanchen8911
Copy link
Copy Markdown
Contributor

Cross-Review Summary for PR #490

Reviewers: Codex, CodeRabbit + Integration Analysis
Rounds: 1 + Codex follow-up
Consensus reached: Yes — no confirmed defects

Confirmed Issues

None. The PR is correct as written.

Positive Observations

  • Registry entry follows established patterns — displayName, version quoting, OCI URL, nodeScheduling structure all consistent with existing components
  • Health check mirrors kubeflow-trainer/kai-scheduler pattern (deployment exists + no unhealthy pods in namespace)
  • controllerManager.tolerations: [{operator: Exists}] matches kai-scheduler/nvsentinel/network-operator convention
  • Kueue webhook runs in the same controller-manager pod (verified against upstream chart v0.17.0), so wiring only controllerManager.* scheduling paths is correct
  • All 3 files automatically embedded via existing //go:embed directives — no Go code changes needed
  • No existing kueue references in codebase — clean integration

Cross-review by Claude Code + Codex + CodeRabbit

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: overlay support for Kueue

4 participants