Skip to content

fix(mock-backend): write toolkit-ready marker so operand pods unblock (RUN-38195)#198

Open
eliranw wants to merge 1 commit into
mainfrom
eliranw/RUN-38195-mock-toolkit-ready-marker
Open

fix(mock-backend): write toolkit-ready marker so operand pods unblock (RUN-38195)#198
eliranw wants to merge 1 commit into
mainfrom
eliranw/RUN-38195-mock-toolkit-ready-marker

Conversation

@eliranw
Copy link
Copy Markdown
Contributor

@eliranw eliranw commented May 17, 2026

Summary

Without /run/nvidia/validations/toolkit-ready on the host, every gpu-operator operand DaemonSet (nvidia-device-plugin-daemonset, gpu-feature-discovery, nvidia-operator-validator) sits at Init:0/1 forever on mock-NVML nodes. nvidia.com/gpu is never advertised. No workload schedules. Mock backend is functionally broken without this marker.

This PR makes the per-pool nvml-mock DaemonSet write the marker after its setup.sh succeeds, with symmetric removal in preStop.

Why we can't fix this in gpu-operator instead

On real-toolkit clusters that marker is written by gpu-operator's validator DS after exec nvidia-smi succeeds. On mock-NVML the validator's toolkit-validation init container can't exec nvidia-smi — its container has no host mount for /run/nvidia/driver and no CDI injection (it's an init container before kubelet allocates devices). The validator state in gpu-operator is hardcoded to return true in controllers/state_manager.go — no values-level disable.

So either upstream nvml-mock writes the marker (filed as NVIDIA/k8s-test-infra#346), or we write it ourselves in the per-pool DS spec. This PR does the latter as an interim until #346 lands.

Diff

CHANGELOG.md                                                       | 12 ++++++
docs/mock-backend.md                                               | 39 ++++++++++++++++++
internal/status-updater/controllers/mock/resources.go              | 20 +++++++--
internal/status-updater/controllers/mock/resources_test.go         | 24 ++++++++++

Controller (internal/status-updater/controllers/mock/resources.go): the nvml-mock DS container's Command changes from /scripts/entrypoint.sh to a shell wrapper that runs setup.sh && touch marker && exec sleep infinity. PreStop adds rm -f of the marker before invoking upstream cleanup.sh. ~15 lines.

Test: new TestBuildDaemonSet_WritesToolkitReadyMarker verifying the wrapping is in place + preStop symmetry.

Docs (docs/mock-backend.md): adds a Recommended gpu-operator subchart values section documenting toolkit.env: [CREATE_DEVICE_NODES=none] and gfd.enabled: false, plus a Known limitation section explaining the residual validator failure is cosmetic.

CHANGELOG under [Unreleased]/Fixed.

Empirical verification

KIND cluster, single mock pool, gpu-operator subchart enabled with the documented values + this PRs marker write:

Check Result
nvml-mock-mock-a DS pod 1/1 Running, marker file present at /run/nvidia/validations/toolkit-ready
nvidia-device-plugin-daemonset 1/1 Running — was previously stuck at Init:0/1
Worker nvidia.com/gpu allocatable 8
Workload pod with nvidia.com/gpu: 1 + runtimeClassName: nvidia Scheduled, Completed, exit 0
nvidia-smi inside workload container Reports Mock NVIDIA A100-SXM4-40GB, driver 550.163.01
gpu-feature-discovery Disabled per recommended values (FGOs status-exporter covers labeling)
nvidia-operator-validator Still CrashLoopBackOff — documented cosmetic ClusterPolicy NotReady (unfixable from our side)

When this should be removed

When NVIDIA/k8s-test-infra#346 (or equivalent) lands and a new nvml-mock image publishes that writes the marker in setup.sh, the wrapper here becomes redundant. Replace with the upstream entrypoint and bump the charts nvmlMock.image.tag. The test name (TestBuildDaemonSet_WritesToolkitReadyMarker) becomes a regression check until then.

Test plan

  • go test ./internal/status-updater/controllers/mock/... — new spec + existing specs all pass
  • make lint — 0 issues
  • Live KIND verification of the full mock-backend pipeline (smoke test pod with nvidia-smi output)

Links

@eliranw eliranw requested a review from a team as a code owner May 17, 2026 09:44
@eliranw eliranw force-pushed the eliranw/RUN-38195-mock-toolkit-ready-marker branch 4 times, most recently from 7b87562 to 5a9a4c4 Compare May 17, 2026 11:40
GPU Operator operand DaemonSets (`nvidia-device-plugin-daemonset`,
`gpu-feature-discovery`, `nvidia-operator-validator`) ship with a hardcoded
`toolkit-validation` init container that shell-polls for
`/run/nvidia/validations/toolkit-ready` indefinitely.

On real-toolkit clusters that marker is written by gpu-operator's validator
DS after `exec nvidia-smi` succeeds. On mock-NVML nodes the validator's
toolkit-validation init container runs in an isolated mount namespace
without nvidia-smi access, so it never writes the marker. Operand pods stay
at `Init:0/1` forever, `nvidia.com/gpu` is never advertised, no workload
schedules. Verified empirically on KIND.

This wraps the per-pool nvml-mock DS's entrypoint with a marker write
after `setup.sh` succeeds, with symmetric removal in the preStop hook.
Interim while upstream NVIDIA/k8s-test-infra#346 adds the same write to
nvml-mock's setup.sh — drop this wrapper when that lands and a new
nvml-mock image is published.

Also documents the recommended `gpu-operator` subchart values for mock
pools (`toolkit.env: [CREATE_DEVICE_NODES=none]`, `gfd.enabled: false`)
and the residual cosmetic `ClusterPolicy NotReady` due to the validator's
unfixable-from-our-side `exec nvidia-smi` requirement.

RUN-38195

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
@eliranw eliranw force-pushed the eliranw/RUN-38195-mock-toolkit-ready-marker branch from 5a9a4c4 to b87a089 Compare May 17, 2026 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants