feat(trainer): Add GPU passthrough support for container backend by muzzlol · Pull Request #219 · kubeflow/sdk

muzzlol · 2025-12-30T13:33:02Z

What this PR does / why we need it:

Fixes #159

Changes

BaseContainerClientAdapter: Added gpu_count parameter to create_and_start_container
DockerClientAdapter: Configures GPU access via device_requests
PodmanClientAdapter: Configures GPU access via devices parameter
ContainerBackend: Extracts GPU count from resources_per_node, handles macOS warning
Tests: Added 5 test cases covering GPU passthrough scenarios

Checklist:

All pass (make test-python)
Linting passes (make verify)
Documentation added (docstrings for new parameters)
Tests cover new functionality
No breaking changes to public APIs (backward compatible - gpu_count is optional)
Tested locally with both runtimes (docker, podman) on NVIDIA GPU (Turning arch)

Things to mention in docs relating to gpu compatability in containers:

Docker: Install NVIDIA Container Toolkit (nvidia-ctk runtime configure --runtime=docker)
Podman: CDI (configured by default with NVIDIA Container Toolkit v1.12.0+)
Linux and Windows via wsl2 only: macOS users will see a warning that GPU passthrough is unavailable
RHEL/Fedora users with SELinux enforcing may need to run with --security-opt=label=disable if they encounter NVML: Insufficient Permissions errors. according [to](RHEL/Fedora users with SELinux enforcing may need to run with --security-opt=label=disable if they encounter NVML: Insufficient Permissions errors.)

Relevant sources:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html
https://podman-desktop.io/docs/podman/gpu

- Add gpu_count parameter to BaseContainerClientAdapter interface - Implement NVIDIA GPU support in Docker adapter via device_requests - Implement NVIDIA GPU support in Podman adapter via CDI devices - Add macOS detection with warning (GPU passthrough unsupported) - Add unit tests for GPU passthrough functionality Signed-off-by: muzzlol <muzxmmilkhxn@gmail.com>

google-oss-prow · 2025-12-30T13:33:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2025-12-30T13:33:13Z

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Slack: Join our #kubeflow-ml-experience and #kubeflow-trainer Slack channels
Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly meetings

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

muzzlol · 2025-12-30T13:38:31Z

@Fiona-Waters
I couldn’t find any existing SDK docs, so I included the relevant sources and references for maintainers to use or place as needed at the end of the PR desc.

Fiona-Waters

Thank you for this @muzzlol ! I have left one question, and also we definitely need to add docs. There is an issue related to where docs should go here. @kramaranya in the meantime should we create a follow on issue for these docs?

Fiona-Waters · 2026-01-05T15:47:53Z

kubeflow/trainer/backends/container/backend.py

+                            "To use GPUs, run on a Linux machine with NVIDIA drivers "
+                            "and the NVIDIA Container Toolkit installed."
+                        )
+                        gpu_count = None  # Don't attempt GPU passthrough on macOS


@muzzlol Did you look into PyTorch MPS for Mac OS?

Yep. GPU passthrough for containers is still not supported on macOS (apple/container#62) so PyTorch MPS is a no-go unfortunately.
There is a discussion around utilizing PyTorch's Vulkan backend here, but afaik that backend supports mobile/inference-only, not training.

Looking at the comment the ML training is also supported, isn't?
Have you tried to build PyTorch with Vulkan backend as suggested here: https://docs.pytorch.org/tutorials/unstable/vulkan_workflow.html#building-pytorch-with-vulkan-backend ?

Looking at the comment the ML training is also supported, isn't?

The comment seems to be a bit misleading. check

Have you tried to build PyTorch with Vulkan backend as suggested here:
https://docs.pytorch.org/tutorials/unstable/vulkan_workflow.html#building-pytorch-with-vulkan-backend ?

No I have not. I currently do not have access to my mac machine, but I'm happy to try it out tommorow if required.

I think, it would be nice to try this out, and see whether simple PyTorch training examples (e.g. MNIST) will be working.

I think, it would be nice to try this out, and see whether simple PyTorch training examples (e.g. MNIST) will be working.

I tested and the vulkan backend does not have autograd support. Shall I proceed with #219 (comment) ?

kramaranya · 2026-01-05T16:04:02Z

/ok-to-test

coveralls · 2026-01-05T16:06:52Z

Pull Request Test Coverage Report for Build 20597706575

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

29 of 38 (76.32%) changed or added relevant lines in 4 files are covered.
2 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.09%) to 66.457%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
kubeflow/trainer/backends/container/backend_test.py	22	24	91.67%
kubeflow/trainer/backends/container/adapters/podman.py	0	3	0.0%
kubeflow/trainer/backends/container/adapters/docker.py	0	4	0.0%

Files with Coverage Reduction	New Missed Lines	%
kubeflow/trainer/backends/container/adapters/docker.py	1	21.62%
kubeflow/trainer/backends/container/adapters/podman.py	1	16.67%

Totals
Change from base Build 20580259246:	0.09%
Covered Lines:	2540
Relevant Lines:	3822

💛 - Coveralls

kramaranya · 2026-01-05T16:08:18Z

Thank you for this @muzzlol ! I have left one question, and also we definitely need to add docs. There is an issue related to where docs should go here. @kramaranya in the meantime should we create a follow on issue for these docs?

@muzzlol you can open PRs to update the website and trainer examples

muzzlol · 2026-01-05T18:20:30Z

Thank you for this @muzzlol ! I have left one question, and also we definitely need to add docs. There is an issue related to where docs should go here. @kramaranya in the meantime should we create a follow on issue for these docs?

@muzzlol you can open PRs to update the website and trainer examples

👍 on it.

Add runtime NVIDIA GPU detection to automatically enable GPU passthrough when available. Relates to kubeflow/sdk#219 Signed-off-by: muzzlol <muzxmmilkhxn@gmail.com>

google-oss-prow bot requested review from andreyvelich, astefanutti and szaher December 30, 2025 13:33

google-oss-prow bot added the size/L label Dec 30, 2025

Fiona-Waters reviewed Jan 5, 2026

View reviewed changes

google-oss-prow bot added the ok-to-test label Jan 5, 2026

This was referenced Jan 6, 2026

chore(examples): add GPU passthrough support to container backend example kubeflow/trainer#3075

Open

trainer: Add GPU passthrough documentation for local execution mode kubeflow/website#4275

Open

Conversation

muzzlol commented Dec 30, 2025

Changes

Uh oh!

google-oss-prow bot commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

muzzlol commented Dec 30, 2025

Uh oh!

Fiona-Waters left a comment

Choose a reason for hiding this comment

Uh oh!

Fiona-Waters Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

muzzlol Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

muzzlol Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

muzzlol Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

kramaranya commented Jan 5, 2026

Uh oh!

coveralls commented Jan 5, 2026

Pull Request Test Coverage Report for Build 20597706575

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

kramaranya commented Jan 5, 2026

Uh oh!

muzzlol commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments