Skip to content

nvmeof: fix CSI node plugin crash on immutable Linux distributions#6165

Merged
mergify[bot] merged 1 commit into
ceph:develfrom
gadididi:nvmeof/fix_loading_nvme_kmod
Mar 10, 2026
Merged

nvmeof: fix CSI node plugin crash on immutable Linux distributions#6165
mergify[bot] merged 1 commit into
ceph:develfrom
gadididi:nvmeof/fix_loading_nvme_kmod

Conversation

@gadididi

@gadididi gadididi commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

On distributions like Talos Linux, NVMe modules are compiled directly into the kernel (CONFIG_NVME_TCP=y, CONFIG_NVME_FABRICS=y) instead of being loadable .ko files.
Explicitly calling modprobe on nvme_fabrics fails on these systems since there is no .ko file present,
causing the CSI node plugin to crash on startup.

Removing nvme_fabrics from the module load list is safe because it is always a dependency of nvme_tcp.
On normal distributions modprobe loads it automatically as part of the nvme_tcp dependency chain. On immutable distributions it is already baked into the kernel.

We verify the fabrics framework is functional after loading nvme_tcp by checking that
/dev/nvme-fabrics exists.
This device node is created by the kernel on init regardless of whether nvme_fabrics was loaded as a module or compiled in, making it a reliable indicator that NVMe-oF TCP is ready to use.

Related issues

Fixes: #6158

Checklist:

  • Commit Message Formatting: Commit titles and messages follow
    guidelines in the developer
    guide
    .
  • Reviewed the developer guide on Submitting a Pull
    Request
  • Pending release
    notes

    updated with breaking and/or notable changes for the next major release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)

@gadididi gadididi self-assigned this Mar 9, 2026
@gadididi gadididi added the component/nvme-of Issues and PRs related to NVMe-oF. label Mar 9, 2026
@gadididi gadididi requested review from Copilot and nixpanic March 9, 2026 11:39
@mergify mergify Bot added the bug Something isn't working label Mar 9, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts NVMe-oF node plugin initialization to avoid crashing on immutable Linux distributions (e.g., Talos) where NVMe components may be built into the kernel and not available as loadable .ko modules.

Changes:

  • Stop explicitly modprobing nvme_fabrics; only ensure nvme_tcp is loaded.
  • Validate that the NVMe fabrics framework is operational by checking for /dev/nvme-fabrics.

Comment thread internal/nvmeof/nvmeof_initiator.go Outdated
Comment thread internal/nvmeof/nvmeof_initiator.go Outdated
Comment thread internal/nvmeof/nvmeof_initiator.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@nixpanic nixpanic requested a review from a team March 10, 2026 08:38
@nixpanic

Copy link
Copy Markdown
Member

@mergifyio rebase, wait for the rebase to happen and add the ok-to-test label after #6136 is merged.

@nixpanic nixpanic added DNM DO NOT MERGE and removed DNM DO NOT MERGE labels Mar 10, 2026
@nixpanic

Copy link
Copy Markdown
Member

@Mergifyio rebase

On distributions like Talos Linux,
NVMe modules are compiled directly into the kernel
(CONFIG_NVME_TCP=y, CONFIG_NVME_FABRICS=y) instead of
being loadable .ko files.
Explicitly calling modprobe on nvme_fabrics fails on
these systems since there is no .ko file present,
causing the CSI node plugin to crash on startup.

Removing nvme_fabrics from the module load
list is safe because it is always a dependency
of nvme_tcp.
On normal distributions modprobe loads it automatically
as part of the nvme_tcp dependency chain. On immutable
distributions it is already baked into the kernel.

We verify the fabrics framework is functional
after loading nvme_tcp by checking that
/dev/nvme-fabrics exists.
This device node is created by the kernel on init
regardless of whether nvme_fabrics was loaded as a
module or compiled in, making it a
reliable indicator that NVMe-oF TCP is ready to use.

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
@ceph-csi-bot ceph-csi-bot force-pushed the nvmeof/fix_loading_nvme_kmod branch from 818ef22 to 94fefd7 Compare March 10, 2026 12:26
@nixpanic nixpanic added the ok-to-test Label to trigger E2E tests label Mar 10, 2026
@mergify

mergify Bot commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

rebase

✅ Branch has been successfully rebased

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/upgrade-tests-cephfs

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.35

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.34

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/upgrade-tests-rbd

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.35

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.34

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.33

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e/k8s-1.35

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e/k8s-1.34

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.33

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e/k8s-1.33

@ceph-csi-bot ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Mar 10, 2026
@nixpanic nixpanic added the ci/skip/multi-arch-build skip building on multiple architectures label Mar 10, 2026
@mergify mergify Bot merged commit 992d720 into ceph:devel Mar 10, 2026
40 of 43 checks passed
@mergify

mergify Bot commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

Merge Queue Status

  • Entered queue2026-03-10 16:16 UTC · Rule: default
  • Checks passed · in-place
  • Merged2026-03-10 16:16 UTC · at 94fefd7dd92769e5622e9f4c04fe9e27e63527bd

This pull request spent 10 seconds in the queue, with no time running CI.

Required conditions to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-release-v3.16 bug Something isn't working ci/skip/multi-arch-build skip building on multiple architectures component/nvme-of Issues and PRs related to NVMe-oF.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NVMe-oF nodeplugin fails to start on Talos Linux due to hardcoded modprobe (modules are built-in)

5 participants