Skip to content

Feature request: Add API to Query NVLink Remote Link ID #166

@XRFXLP

Description

@XRFXLP

Problem statement

Current situation:

When querying NVLink topology, go-nvml provides:

device.GetNvLinkRemotePciInfo(linkID) 

which gives PCIInfo, but this is incomplete because:

  • A GPU can have multiple links connecting to the same remote device (NVSwitch or peer GPU)
  • Each link connects to a different port on that remote device
  • The PCI address alone doesn't tell us which port

For instance:

GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-81..bcd)
	 Link 0: Remote Device 00000008:00:00.0: Link 32
	 Link 1: Remote Device 00000008:00:00.0: Link 33

What's available:

remotePCI, _ := device.GetNvLinkRemotePciInfo(0) // Returns "00000008:00:00.0"
remotePCI, _ := device.GetNvLinkRemotePciInfo(1) // Returns "00000008:00:00.0" (same)

What's missing:

remoteLink, _ := device.GetNvLinkRemoteLinkId(0) 
// Should return: 32 for link 0, 33 for link 1

Use Case:

In SXID errors, we've something like:

nvidia-nvswitch3: SXid (0008:00:00.0: 20034, Fatal, Link 29 LTSSM Fault Up

Q: Which GPU is affected by this NVSwitch Link 29 error?

Required mapping: (NVSwitch_PCI, Remote_Link) -> (GPU_ID, Local_Link)

Example reverse lookup map:

topology["0008:00:00.0"][29] = {GPUID: 0, LocalLink: 0}
topology["0008:00:00.0"][28] = {GPUID: 0, LocalLink: 1}
topology["0008:00:00.0"][32] = {GPUID: 5, LocalLink: 0}
topology["0008:00:00.0"][33] = {GPUID: 5, LocalLink: 1}

Current workaround

Right now, we depend on nvidia-smi nvlink -R to get output like:

$ nvidia-smi nvlink -R
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-0b..bf)
	 Link 0: Remote Device 00000008:00:00.0: Link 29
	 Link 1: Remote Device 00000008:00:00.0: Link 28
	 Link 2: Remote Device 00000005:00:00.0: Link 34
	 Link 3: Remote Device 00000005:00:00.0: Link 35
	 Link 4: Remote Device 00000007:00:00.0: Link 34
	 Link 5: Remote Device 00000007:00:00.0: Link 35
	 Link 6: Remote Device 0000000A:00:00.0: Link 26
	 Link 7: Remote Device 0000000A:00:00.0: Link 27
	 Link 8: Remote Device 00000006:00:00.0: Link 8
	 Link 9: Remote Device 00000006:00:00.0: Link 9
	 Link 10: Remote Device 00000009:00:00.0: Link 12
	 Link 11: Remote Device 00000009:00:00.0: Link 13

Problems with this approach is:

  • Blocks pure-Go applications
  • Needs parsing => fragile
  • Not portable since it needs nvidia-smi in $PATH

Potential solution

Add a new method to the Device interface:

type Device interface {
    // Existing methods
    GetNvLinkRemotePciInfo(int) (PciInfo, Return)
    GetNvLinkRemoteDeviceType(int) (IntNvLinkDeviceType, Return)
    GetNvLinkState(int) (EnableState, Return)
    
    // NEW: Get the link/port number on the remote device
    GetNvLinkRemoteLinkId(linkID int) (uint, Return)
}

Expected usage:

device, _ := nvml.DeviceGetHandleByIndex(0)

for localLink := 0; localLink < nvml.NVLINK_MAX_LINKS; localLink++ {
    remotePCI, _ := device.GetNvLinkRemotePciInfo(localLink)
    remoteLink, _ := device.GetNvLinkRemoteLinkId(localLink)  // ← NEW API
    
    // Build complete topology map
    topology[remotePCI][remoteLink] = {GPUID: 0, LocalLink: localLink}
}

// Query: Which GPU is affected by NVSwitch 0008:00:00.0 Link 29 error?
affectedGPU := topology["0008:00:00.0"][29]

As this functionality is already available in nvidia-smi, it might be already present in underlying C library, most likely go binding is missing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions