feat(trainer): Add GPU passthrough support for container backend#219
feat(trainer): Add GPU passthrough support for container backend#219muzzlol wants to merge 1 commit intokubeflow:mainfrom
Conversation
- Add gpu_count parameter to BaseContainerClientAdapter interface - Implement NVIDIA GPU support in Docker adapter via device_requests - Implement NVIDIA GPU support in Podman adapter via CDI devices - Add macOS detection with warning (GPU passthrough unsupported) - Add unit tests for GPU passthrough functionality Signed-off-by: muzzlol <muzxmmilkhxn@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
|
@Fiona-Waters |
Fiona-Waters
left a comment
There was a problem hiding this comment.
Thank you for this @muzzlol ! I have left one question, and also we definitely need to add docs. There is an issue related to where docs should go here. @kramaranya in the meantime should we create a follow on issue for these docs?
| "To use GPUs, run on a Linux machine with NVIDIA drivers " | ||
| "and the NVIDIA Container Toolkit installed." | ||
| ) | ||
| gpu_count = None # Don't attempt GPU passthrough on macOS |
There was a problem hiding this comment.
@muzzlol Did you look into PyTorch MPS for Mac OS?
There was a problem hiding this comment.
Yep. GPU passthrough for containers is still not supported on macOS (apple/container#62) so PyTorch MPS is a no-go unfortunately.
There is a discussion around utilizing PyTorch's Vulkan backend here, but afaik that backend supports mobile/inference-only, not training.
There was a problem hiding this comment.
Looking at the comment the ML training is also supported, isn't?
Have you tried to build PyTorch with Vulkan backend as suggested here: https://docs.pytorch.org/tutorials/unstable/vulkan_workflow.html#building-pytorch-with-vulkan-backend ?
There was a problem hiding this comment.
Looking at the comment the ML training is also supported, isn't?
The comment seems to be a bit misleading. check
Have you tried to build PyTorch with Vulkan backend as suggested here:
https://docs.pytorch.org/tutorials/unstable/vulkan_workflow.html#building-pytorch-with-vulkan-backend ?
No I have not. I currently do not have access to my mac machine, but I'm happy to try it out tommorow if required.
There was a problem hiding this comment.
I think, it would be nice to try this out, and see whether simple PyTorch training examples (e.g. MNIST) will be working.
There was a problem hiding this comment.
I think, it would be nice to try this out, and see whether simple PyTorch training examples (e.g. MNIST) will be working.
I tested and the vulkan backend does not have autograd support. Shall I proceed with #219 (comment) ?
|
/ok-to-test |
Pull Request Test Coverage Report for Build 20597706575Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
@muzzlol you can open PRs to update the website and trainer examples |
👍 on it. |
Add runtime NVIDIA GPU detection to automatically enable GPU passthrough when available. Relates to kubeflow/sdk#219 Signed-off-by: muzzlol <muzxmmilkhxn@gmail.com>
Add runtime NVIDIA GPU detection to automatically enable GPU passthrough when available. Relates to kubeflow/sdk#219 Signed-off-by: muzzlol <muzxmmilkhxn@gmail.com>
Add runtime NVIDIA GPU detection to automatically enable GPU passthrough when available. Relates to kubeflow/sdk#219 Signed-off-by: muzzlol <muzxmmilkhxn@gmail.com>
What this PR does / why we need it:
Fixes #159
Changes
gpu_countparameter tocreate_and_start_containerdevice_requestsdevicesparameterresources_per_node, handles macOS warningChecklist:
Things to mention in docs relating to gpu compatability in containers:
Relevant sources:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html
https://podman-desktop.io/docs/podman/gpu