Skip to content

feat(trainer): Add GPU passthrough support for container backend#219

Open
muzzlol wants to merge 1 commit intokubeflow:mainfrom
muzzlol:feat/container-gpu-passthrough
Open

feat(trainer): Add GPU passthrough support for container backend#219
muzzlol wants to merge 1 commit intokubeflow:mainfrom
muzzlol:feat/container-gpu-passthrough

Conversation

@muzzlol
Copy link

@muzzlol muzzlol commented Dec 30, 2025

What this PR does / why we need it:

Fixes #159

Changes

  • BaseContainerClientAdapter: Added gpu_count parameter to create_and_start_container
  • DockerClientAdapter: Configures GPU access via device_requests
  • PodmanClientAdapter: Configures GPU access via devices parameter
  • ContainerBackend: Extracts GPU count from resources_per_node, handles macOS warning
  • Tests: Added 5 test cases covering GPU passthrough scenarios

Checklist:

  • All pass (make test-python)
  • Linting passes (make verify)
  • Documentation added (docstrings for new parameters)
  • Tests cover new functionality
  • No breaking changes to public APIs (backward compatible - gpu_count is optional)
  • Tested locally with both runtimes (docker, podman) on NVIDIA GPU (Turning arch)

Things to mention in docs relating to gpu compatability in containers:

  • Docker: Install NVIDIA Container Toolkit (nvidia-ctk runtime configure --runtime=docker)
  • Podman: CDI (configured by default with NVIDIA Container Toolkit v1.12.0+)
  • Linux and Windows via wsl2 only: macOS users will see a warning that GPU passthrough is unavailable
  • RHEL/Fedora users with SELinux enforcing may need to run with --security-opt=label=disable if they encounter NVML: Insufficient Permissions errors. according [to](RHEL/Fedora users with SELinux enforcing may need to run with --security-opt=label=disable if they encounter NVML: Insufficient Permissions errors.)

Relevant sources:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html
https://podman-desktop.io/docs/podman/gpu

  - Add gpu_count parameter to BaseContainerClientAdapter interface
  - Implement NVIDIA GPU support in Docker adapter via device_requests
  - Implement NVIDIA GPU support in Podman adapter via CDI devices
  - Add macOS detection with warning (GPU passthrough unsupported)
  - Add unit tests for GPU passthrough functionality

Signed-off-by: muzzlol <muzxmmilkhxn@gmail.com>
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@muzzlol
Copy link
Author

muzzlol commented Dec 30, 2025

@Fiona-Waters
I couldn’t find any existing SDK docs, so I included the relevant sources and references for maintainers to use or place as needed at the end of the PR desc.

Copy link
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @muzzlol ! I have left one question, and also we definitely need to add docs. There is an issue related to where docs should go here. @kramaranya in the meantime should we create a follow on issue for these docs?

"To use GPUs, run on a Linux machine with NVIDIA drivers "
"and the NVIDIA Container Toolkit installed."
)
gpu_count = None # Don't attempt GPU passthrough on macOS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muzzlol Did you look into PyTorch MPS for Mac OS?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. GPU passthrough for containers is still not supported on macOS (apple/container#62) so PyTorch MPS is a no-go unfortunately.
There is a discussion around utilizing PyTorch's Vulkan backend here, but afaik that backend supports mobile/inference-only, not training.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the comment the ML training is also supported, isn't?
Have you tried to build PyTorch with Vulkan backend as suggested here: https://docs.pytorch.org/tutorials/unstable/vulkan_workflow.html#building-pytorch-with-vulkan-backend ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the comment the ML training is also supported, isn't?

The comment seems to be a bit misleading. check

Have you tried to build PyTorch with Vulkan backend as suggested here:
https://docs.pytorch.org/tutorials/unstable/vulkan_workflow.html#building-pytorch-with-vulkan-backend ?

No I have not. I currently do not have access to my mac machine, but I'm happy to try it out tommorow if required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it would be nice to try this out, and see whether simple PyTorch training examples (e.g. MNIST) will be working.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it would be nice to try this out, and see whether simple PyTorch training examples (e.g. MNIST) will be working.

I tested and the vulkan backend does not have autograd support. Shall I proceed with #219 (comment) ?

@kramaranya
Copy link
Contributor

/ok-to-test

@coveralls
Copy link

Pull Request Test Coverage Report for Build 20597706575

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 29 of 38 (76.32%) changed or added relevant lines in 4 files are covered.
  • 2 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.09%) to 66.457%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/trainer/backends/container/backend_test.py 22 24 91.67%
kubeflow/trainer/backends/container/adapters/podman.py 0 3 0.0%
kubeflow/trainer/backends/container/adapters/docker.py 0 4 0.0%
Files with Coverage Reduction New Missed Lines %
kubeflow/trainer/backends/container/adapters/docker.py 1 21.62%
kubeflow/trainer/backends/container/adapters/podman.py 1 16.67%
Totals Coverage Status
Change from base Build 20580259246: 0.09%
Covered Lines: 2540
Relevant Lines: 3822

💛 - Coveralls

@kramaranya
Copy link
Contributor

Thank you for this @muzzlol ! I have left one question, and also we definitely need to add docs. There is an issue related to where docs should go here. @kramaranya in the meantime should we create a follow on issue for these docs?

@muzzlol you can open PRs to update the website and trainer examples

@muzzlol
Copy link
Author

muzzlol commented Jan 5, 2026

Thank you for this @muzzlol ! I have left one question, and also we definitely need to add docs. There is an issue related to where docs should go here. @kramaranya in the meantime should we create a follow on issue for these docs?

@muzzlol you can open PRs to update the website and trainer examples

👍 on it.

muzzlol added a commit to muzzlol/trainer that referenced this pull request Jan 6, 2026
Add runtime NVIDIA GPU detection to automatically enable GPU passthrough when
available.

Relates to kubeflow/sdk#219

Signed-off-by: muzzlol <muzxmmilkhxn@gmail.com>
muzzlol added a commit to muzzlol/trainer that referenced this pull request Jan 6, 2026
Add runtime NVIDIA GPU detection to automatically enable GPU passthrough when
available.

Relates to kubeflow/sdk#219

Signed-off-by: muzzlol <muzxmmilkhxn@gmail.com>
muzzlol added a commit to muzzlol/trainer that referenced this pull request Jan 6, 2026
Add runtime NVIDIA GPU detection to automatically enable GPU passthrough when
available.
Relates to kubeflow/sdk#219
Signed-off-by: muzzlol <muzxmmilkhxn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable GPU Support in the Kubeflow SDK Container Backend

5 participants

Comments