fix(mock-backend): write toolkit-ready marker so operand pods unblock (RUN-38195)#198
Open
eliranw wants to merge 1 commit into
Open
fix(mock-backend): write toolkit-ready marker so operand pods unblock (RUN-38195)#198eliranw wants to merge 1 commit into
eliranw wants to merge 1 commit into
Conversation
7b87562 to
5a9a4c4
Compare
GPU Operator operand DaemonSets (`nvidia-device-plugin-daemonset`, `gpu-feature-discovery`, `nvidia-operator-validator`) ship with a hardcoded `toolkit-validation` init container that shell-polls for `/run/nvidia/validations/toolkit-ready` indefinitely. On real-toolkit clusters that marker is written by gpu-operator's validator DS after `exec nvidia-smi` succeeds. On mock-NVML nodes the validator's toolkit-validation init container runs in an isolated mount namespace without nvidia-smi access, so it never writes the marker. Operand pods stay at `Init:0/1` forever, `nvidia.com/gpu` is never advertised, no workload schedules. Verified empirically on KIND. This wraps the per-pool nvml-mock DS's entrypoint with a marker write after `setup.sh` succeeds, with symmetric removal in the preStop hook. Interim while upstream NVIDIA/k8s-test-infra#346 adds the same write to nvml-mock's setup.sh — drop this wrapper when that lands and a new nvml-mock image is published. Also documents the recommended `gpu-operator` subchart values for mock pools (`toolkit.env: [CREATE_DEVICE_NODES=none]`, `gfd.enabled: false`) and the residual cosmetic `ClusterPolicy NotReady` due to the validator's unfixable-from-our-side `exec nvidia-smi` requirement. RUN-38195 Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
5a9a4c4 to
b87a089
Compare
iris-shain-runai
approved these changes
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Without
/run/nvidia/validations/toolkit-readyon the host, every gpu-operator operand DaemonSet (nvidia-device-plugin-daemonset,gpu-feature-discovery,nvidia-operator-validator) sits atInit:0/1forever on mock-NVML nodes.nvidia.com/gpuis never advertised. No workload schedules. Mock backend is functionally broken without this marker.This PR makes the per-pool
nvml-mockDaemonSet write the marker after itssetup.shsucceeds, with symmetric removal inpreStop.Why we can't fix this in gpu-operator instead
On real-toolkit clusters that marker is written by gpu-operator's validator DS after
exec nvidia-smisucceeds. On mock-NVML the validator'stoolkit-validationinit container can'texec nvidia-smi— its container has no host mount for/run/nvidia/driverand no CDI injection (it's an init container before kubelet allocates devices). The validator state in gpu-operator is hardcoded toreturn trueincontrollers/state_manager.go— no values-level disable.So either upstream nvml-mock writes the marker (filed as NVIDIA/k8s-test-infra#346), or we write it ourselves in the per-pool DS spec. This PR does the latter as an interim until #346 lands.
Diff
Controller (
internal/status-updater/controllers/mock/resources.go): the nvml-mock DS container'sCommandchanges from/scripts/entrypoint.shto a shell wrapper that runssetup.sh && touch marker && exec sleep infinity. PreStop addsrm -fof the marker before invoking upstreamcleanup.sh. ~15 lines.Test: new
TestBuildDaemonSet_WritesToolkitReadyMarkerverifying the wrapping is in place + preStop symmetry.Docs (
docs/mock-backend.md): adds a Recommended gpu-operator subchart values section documentingtoolkit.env: [CREATE_DEVICE_NODES=none]andgfd.enabled: false, plus a Known limitation section explaining the residual validator failure is cosmetic.CHANGELOG under
[Unreleased]/Fixed.Empirical verification
KIND cluster, single mock pool, gpu-operator subchart enabled with the documented values + this PRs marker write:
nvml-mock-mock-aDS pod1/1 Running, marker file present at/run/nvidia/validations/toolkit-readynvidia-device-plugin-daemonset1/1 Running— was previously stuck atInit:0/1nvidia.com/gpuallocatable8nvidia.com/gpu: 1+runtimeClassName: nvidianvidia-smiinside workload containerMock NVIDIA A100-SXM4-40GB, driver 550.163.01gpu-feature-discoverynvidia-operator-validatorCrashLoopBackOff— documented cosmetic ClusterPolicy NotReady (unfixable from our side)When this should be removed
When NVIDIA/k8s-test-infra#346 (or equivalent) lands and a new nvml-mock image publishes that writes the marker in
setup.sh, the wrapper here becomes redundant. Replace with the upstream entrypoint and bump the chartsnvmlMock.image.tag. The test name (TestBuildDaemonSet_WritesToolkitReadyMarker) becomes a regression check until then.Test plan
go test ./internal/status-updater/controllers/mock/...— new spec + existing specs all passmake lint— 0 issuesnvidia-smioutput)Links