nvmeof: fix CSI node plugin crash on immutable Linux distributions#6165
Conversation
There was a problem hiding this comment.
Pull request overview
Adjusts NVMe-oF node plugin initialization to avoid crashing on immutable Linux distributions (e.g., Talos) where NVMe components may be built into the kernel and not available as loadable .ko modules.
Changes:
- Stop explicitly modprobing
nvme_fabrics; only ensurenvme_tcpis loaded. - Validate that the NVMe fabrics framework is operational by checking for
/dev/nvme-fabrics.
5879c0f to
818ef22
Compare
|
|
|
@Mergifyio rebase |
On distributions like Talos Linux, NVMe modules are compiled directly into the kernel (CONFIG_NVME_TCP=y, CONFIG_NVME_FABRICS=y) instead of being loadable .ko files. Explicitly calling modprobe on nvme_fabrics fails on these systems since there is no .ko file present, causing the CSI node plugin to crash on startup. Removing nvme_fabrics from the module load list is safe because it is always a dependency of nvme_tcp. On normal distributions modprobe loads it automatically as part of the nvme_tcp dependency chain. On immutable distributions it is already baked into the kernel. We verify the fabrics framework is functional after loading nvme_tcp by checking that /dev/nvme-fabrics exists. This device node is created by the kernel on init regardless of whether nvme_fabrics was loaded as a module or compiled in, making it a reliable indicator that NVMe-oF TCP is ready to use. Signed-off-by: gadi-didi <gadi.didi@ibm.com>
818ef22 to
94fefd7
Compare
✅ Branch has been successfully rebased |
|
/test ci/centos/upgrade-tests-cephfs |
|
/test ci/centos/k8s-e2e-external-storage/1.35 |
|
/test ci/centos/k8s-e2e-external-storage/1.34 |
|
/test ci/centos/upgrade-tests-rbd |
|
/test ci/centos/mini-e2e-helm/k8s-1.35 |
|
/test ci/centos/mini-e2e-helm/k8s-1.34 |
|
/test ci/centos/k8s-e2e-external-storage/1.33 |
|
/test ci/centos/mini-e2e/k8s-1.35 |
|
/test ci/centos/mini-e2e/k8s-1.34 |
|
/test ci/centos/mini-e2e-helm/k8s-1.33 |
|
/test ci/centos/mini-e2e/k8s-1.33 |
Merge Queue Status
This pull request spent 10 seconds in the queue, with no time running CI. Required conditions to merge
|
On distributions like Talos Linux, NVMe modules are compiled directly into the kernel (CONFIG_NVME_TCP=y, CONFIG_NVME_FABRICS=y) instead of being loadable .ko files.
Explicitly calling modprobe on nvme_fabrics fails on these systems since there is no .ko file present,
causing the CSI node plugin to crash on startup.
Removing nvme_fabrics from the module load list is safe because it is always a dependency of nvme_tcp.
On normal distributions modprobe loads it automatically as part of the nvme_tcp dependency chain. On immutable distributions it is already baked into the kernel.
We verify the fabrics framework is functional after loading nvme_tcp by checking that
/dev/nvme-fabrics exists.
This device node is created by the kernel on init regardless of whether nvme_fabrics was loaded as a module or compiled in, making it a reliable indicator that NVMe-oF TCP is ready to use.
Related issues
Fixes: #6158
Checklist:
guidelines in the developer
guide.
Request
notes
updated with breaking and/or notable changes for the next major release.
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>: retest the<job-name>after unrelatedfailure (please report the failure too!)