Skip to content

Comments

Fix removal of GPUs from nvlink partitions#337

Merged
tmcroberts97 merged 2 commits intoNVIDIA:mainfrom
tmcroberts97:fix/partition-removal
Feb 23, 2026
Merged

Fix removal of GPUs from nvlink partitions#337
tmcroberts97 merged 2 commits intoNVIDIA:mainfrom
tmcroberts97:fix/partition-removal

Conversation

@tmcroberts97
Copy link
Contributor

Description

Refactor the nvlink partition monitor so that it iterates through the nvlink_info of every instance, instead of the instance gpu config. The gpu config array will not be populated when a user attempts to detach GPUs, which meant the monitor had nothing to iterate over, so didn't remove the GPUs from the partition (or take any action at all in fact).

Iterating over the nvlink_info array of GPUs fixes this, since they should always be present.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@tmcroberts97 tmcroberts97 requested a review from a team as a code owner February 19, 2026 17:16
Refactor the nvlink partition monitor so that it iterates through the
nvlink_info of every instance, instead of the instance gpu config. The
gpu config array will not be populated when a user attempts to detach
GPUs, which meant the monitor had nothing to iterate over, so didn't
remove the GPUs from the partition (or take any action at all in fact).

Iterating over the nvlink_info array of GPUs fixes this, since they
should always be present.

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
Copy link
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some readability nits and some potential simplification, otherwise lgtm.

gpu_config.logical_partition_id;
if db_logical_partition_id.is_none() {
// How can this happen?
tracing::error!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the GPU is in an inconsistent state, and if tenant applies nvlink_config, we will never be able to sync it.
Should we take a corrective action of removing the GPU from the partition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. I can make another MR with this change.

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
@tmcroberts97
Copy link
Contributor Author

@kensimon thanks for the review! All changes made.

@tmcroberts97 tmcroberts97 merged commit fe05143 into NVIDIA:main Feb 23, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants