Skip to content

[Feature]: Leverage kubernetes' live reload functionality for the enterprise license secret #2279

@wesleymerrick

Description

@wesleymerrick

Summary

When the Kubernetes secret referenced by licensingConfig is updated (e.g. a rotated NLS token via ExternalSecret), running nvidia-driver-daemonset pods never receive the updated files. Manual pod restart is required. This is due to the use of subPath in the volume mounts for both gridd.conf and client_configuration_token.tok.

This feature request is for enabling automatic reloading of client token and gridd.conf files without the need to restart the driver daemonset.

Kubernetes version

Any

GPU Operator version

Confirmed on v25.10.0; all subsequent versions likely affected (no changelog notes found up through v26.3.0 at the time of writing)

Current behavior

The operator renders the driver DaemonSet with these volume mounts:

volumes:
  - name: licensing-config-vol
    secret:
      secretName: <Values.driver.licensingConfig.secretName>
      items:
        - key: gridd.conf
          path: gridd.conf
        - key: client_configuration_token.tok
          path: client_configuration_token.tok

volumeMounts:
  - name: licensing-config-vol
    mountPath: /drivers/gridd.conf
    subPath: gridd.conf
    readOnly: true
  - name: licensing-config-vol
    mountPath: /drivers/ClientConfigToken/client_configuration_token.tok
    subPath: client_configuration_token.tok
    readOnly: true

Obstacle to secret propagation

This is a documented Kubernetes limitation: see note "A container using a Secret as a subPath volume mount does not receive automated Secret updates.". The files are frozen at the values present during pod startup, regardless of whether the secret the mount references is updated during the lifetime of the pod.

If the mounts were changed to avoid the use of subPath, Kubernetes would atomically update the mounted secret files.

Impact

Any workflow that rotates the NLS license token (e.g. via ExternalSecrets, Vault, or manual kubectl patch) requires a rollout restart of nvidia-driver-daemonset on every affected cluster to take effect. This is disruptive, and could be avoided with some changes:

Proposed change

Replace the two subPath file mounts with a single directory mount of the entire secret, then update the driver container startup script to read from the mounted directory.

Volume:

volumes:
  - name: licensing-config-vol
    secret:
      secretName: <Values.driver.licensingConfig.secretName>

Mounts:

volumeMounts:
  - name: licensing-config-vol
    mountPath: /drivers/licensing-config  (or whatever name is most appropriate)
    readOnly: true

This mounts both gridd.conf and client_configuration_token.tok at /drivers/licensing-config/, and Kubernetes will automatically propagate secret updates to those paths.

The driver container startup logic would need a corresponding update:

  • Read gridd.conf from /drivers/licensing-config/gridd.conf (instead of /drivers/gridd.conf)
  • Read client_configuration_token.tok from /drivers/licensing-config/client_configuration_token.tok (instead of /drivers/ClientConfigToken/)

Alternative suggestion: move gridd.conf back to a configMap

This would allow mounting two volumes, one from a secret for the token and another from a configmap for gridd.conf, however without projecting paths with subPath new mount target folders would still likely be needed at least for gridd.conf assuming we can't overwrite the entire contents of /drivers/ with a configMap mount.

Symlinks could be added to link these new mount targets back to the locations original locations of /drivers/gridd.conf and /drivers/ClientConfigToken/client_configuration_token.tok

Additional implementation of file watcher in the nvidia-gridd process

Changing the mount in this way is not sufficient for zero-restart token rotation unless nvidia-gridd already supports inotify/polling for these file changes. After Kubernetes propagates the updated file, nvidia-gridd still needs to re-read it. Adding some kind of file-watcher sidecar container or process which signals/restarts the nvidia-gridd process when it detects changed files within the mounted volume could be one way to add this functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureissue/PR that proposes a new feature or functionalitylifecycle/frozenneeds-triageissue or PR has not been assigned a priority-px label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions