fix(mock-backend): write toolkit-ready marker so operand pods unblock (RUN-38195) by eliranw · Pull Request #198 · run-ai/fake-gpu-operator

eliranw · 2026-05-17T09:44:28Z

Summary

Without /run/nvidia/validations/toolkit-ready on the host, every gpu-operator operand DaemonSet (nvidia-device-plugin-daemonset, gpu-feature-discovery, nvidia-operator-validator) sits at Init:0/1 forever on mock-NVML nodes. nvidia.com/gpu is never advertised. No workload schedules. Mock backend is functionally broken without this marker.

This PR makes the per-pool nvml-mock DaemonSet write the marker after its setup.sh succeeds, with symmetric removal in preStop.

Why we can't fix this in gpu-operator instead

On real-toolkit clusters that marker is written by gpu-operator's validator DS after exec nvidia-smi succeeds. On mock-NVML the validator's toolkit-validation init container can't exec nvidia-smi — its container has no host mount for /run/nvidia/driver and no CDI injection (it's an init container before kubelet allocates devices). The validator state in gpu-operator is hardcoded to return true in controllers/state_manager.go — no values-level disable.

So either upstream nvml-mock writes the marker (filed as NVIDIA/k8s-test-infra#346), or we write it ourselves in the per-pool DS spec. This PR does the latter as an interim until #346 lands.

Diff

CHANGELOG.md                                                       | 12 ++++++
docs/mock-backend.md                                               | 39 ++++++++++++++++++
internal/status-updater/controllers/mock/resources.go              | 20 +++++++--
internal/status-updater/controllers/mock/resources_test.go         | 24 ++++++++++

Controller (internal/status-updater/controllers/mock/resources.go): the nvml-mock DS container's Command changes from /scripts/entrypoint.sh to a shell wrapper that runs setup.sh && touch marker && exec sleep infinity. PreStop adds rm -f of the marker before invoking upstream cleanup.sh. ~15 lines.

Test: new TestBuildDaemonSet_WritesToolkitReadyMarker verifying the wrapping is in place + preStop symmetry.

Docs (docs/mock-backend.md): adds a Recommended gpu-operator subchart values section documenting toolkit.env: [CREATE_DEVICE_NODES=none] and gfd.enabled: false, plus a Known limitation section explaining the residual validator failure is cosmetic.

CHANGELOG under [Unreleased]/Fixed.

Empirical verification

KIND cluster, single mock pool, gpu-operator subchart enabled with the documented values + this PRs marker write:

Check	Result
`nvml-mock-mock-a` DS pod	`1/1 Running`, marker file present at `/run/nvidia/validations/toolkit-ready`
`nvidia-device-plugin-daemonset`	`1/1 Running` — was previously stuck at `Init:0/1`
Worker `nvidia.com/gpu` allocatable	`8`
Workload pod with `nvidia.com/gpu: 1` + `runtimeClassName: nvidia`	Scheduled, Completed, exit 0
`nvidia-smi` inside workload container	Reports `Mock NVIDIA A100-SXM4-40GB`, driver 550.163.01
`gpu-feature-discovery`	Disabled per recommended values (FGOs status-exporter covers labeling)
`nvidia-operator-validator`	Still `CrashLoopBackOff` — documented cosmetic ClusterPolicy NotReady (unfixable from our side)

When this should be removed

When NVIDIA/k8s-test-infra#346 (or equivalent) lands and a new nvml-mock image publishes that writes the marker in setup.sh, the wrapper here becomes redundant. Replace with the upstream entrypoint and bump the charts nvmlMock.image.tag. The test name (TestBuildDaemonSet_WritesToolkitReadyMarker) becomes a regression check until then.

Test plan

go test ./internal/status-updater/controllers/mock/... — new spec + existing specs all pass
make lint — 0 issues
Live KIND verification of the full mock-backend pipeline (smoke test pod with nvidia-smi output)

Links

Upstream PR: NVIDIA/k8s-test-infra#346
Epic: RUN-38195

GPU Operator operand DaemonSets (`nvidia-device-plugin-daemonset`, `gpu-feature-discovery`, `nvidia-operator-validator`) ship with a hardcoded `toolkit-validation` init container that shell-polls for `/run/nvidia/validations/toolkit-ready` indefinitely. On real-toolkit clusters that marker is written by gpu-operator's validator DS after `exec nvidia-smi` succeeds. On mock-NVML nodes the validator's toolkit-validation init container runs in an isolated mount namespace without nvidia-smi access, so it never writes the marker. Operand pods stay at `Init:0/1` forever, `nvidia.com/gpu` is never advertised, no workload schedules. Verified empirically on KIND. This wraps the per-pool nvml-mock DS's entrypoint with a marker write after `setup.sh` succeeds, with symmetric removal in the preStop hook. Interim while upstream NVIDIA/k8s-test-infra#346 adds the same write to nvml-mock's setup.sh — drop this wrapper when that lands and a new nvml-mock image is published. Also documents the recommended `gpu-operator` subchart values for mock pools (`toolkit.env: [CREATE_DEVICE_NODES=none]`, `gfd.enabled: false`) and the residual cosmetic `ClusterPolicy NotReady` due to the validator's unfixable-from-our-side `exec nvidia-smi` requirement. RUN-38195 Signed-off-by: Eliran Wolff <eliranw@nvidia.com>

eliranw requested a review from a team as a code owner May 17, 2026 09:44

eliranw force-pushed the eliranw/RUN-38195-mock-toolkit-ready-marker branch 4 times, most recently from 7b87562 to 5a9a4c4 Compare May 17, 2026 11:40

eliranw force-pushed the eliranw/RUN-38195-mock-toolkit-ready-marker branch from 5a9a4c4 to b87a089 Compare May 17, 2026 11:58

iris-shain-runai approved these changes May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mock-backend): write toolkit-ready marker so operand pods unblock (RUN-38195)#198

fix(mock-backend): write toolkit-ready marker so operand pods unblock (RUN-38195)#198
eliranw wants to merge 1 commit into
mainfrom
eliranw/RUN-38195-mock-toolkit-ready-marker

eliranw commented May 17, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eliranw commented May 17, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why we can't fix this in gpu-operator instead

Diff

Empirical verification

When this should be removed

Test plan

Links

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eliranw commented May 17, 2026 •

edited by atlassian Bot

Loading