Skip to content

Commit 39b751d

Browse files
committed
Update on "[slimtensor] Add CUDA Storage with DeviceTraits and memory allocation"
This diff adds CUDA storage infrastructure to SlimTensor, enabling GPU memory allocation and management. **Key changes:** 1. **`cuda/Guard.h`** - CUDAGuard RAII class: - Saves current CUDA device on construction, restores on destruction - Exception-safe device context switching - Constructors accept device index or Device object 2. **`core/Storage.h`** - Extended for CUDA support: - Added `DeviceTraits<DeviceType::CUDA>` specialization with: - `allocate()` - Uses cudaMalloc with CUDAGuard for device selection - `free()` - Uses cudaFree with warning on error - `memcpy()` - Supports Host↔Device and Device↔Device copies - Added `DEFAULT_CUDA_DEVICE` constant - Updated `MaybeOwningStorage` constructor to handle CUDA devices - Stub implementation when `CUDA_AVAILABLE` is not defined (throws error) Differential Revision: [D91202899](https://our.internmc.facebook.com/intern/diff/D91202899/) [ghstack-poisoned]
2 parents 27ea3ee + 4327c43 commit 39b751d

File tree

12 files changed

+260
-27
lines changed

12 files changed

+260
-27
lines changed

backends/qualcomm/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ Please check `generate_qnn_executorch_compiler_spec()` in
2424
- Snapdragon 8 Elite Gen 5
2525
- SA8295
2626
- SA8255
27+
- SA8797 (also used by SA8397)
2728
- SSG2115P
2829
- SSG2125P
2930
- SXR1230P

backends/qualcomm/serialization/qc_compiler_spec.fbs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ table HtpInfo {
3434
enum QcomChipset: int {
3535
UNKNOWN_SM = 0,
3636
SA8295 = 39,
37+
SA8797 = 72,
3738
SM8350 = 30,
3839
SM8450 = 36,
3940
SM8475 = 42,

backends/qualcomm/serialization/qc_schema.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ class HtpInfo:
4040
class QcomChipset(IntEnum):
4141
UNKNOWN_SM = 0
4242
SA8295 = 39 # v68
43+
SA8797 = 72 # v81
4344
SM8350 = 30 # v68
4445
SM8450 = 36 # v69
4546
SM8475 = 42 # v69
@@ -68,6 +69,7 @@ class SocInfo:
6869

6970
_soc_info_table = {
7071
QcomChipset.SA8295: SocInfo(QcomChipset.SA8295, HtpInfo(HtpArch.V68, 8)),
72+
QcomChipset.SA8797: SocInfo(QcomChipset.SA8797, HtpInfo(HtpArch.V81, 16)),
7173
QcomChipset.SM8350: SocInfo(QcomChipset.SM8350, HtpInfo(HtpArch.V68, 4)),
7274
QcomChipset.SM8450: SocInfo(QcomChipset.SM8450, HtpInfo(HtpArch.V69, 8)),
7375
QcomChipset.SM8475: SocInfo(QcomChipset.SM8475, HtpInfo(HtpArch.V69, 8)),

backends/qualcomm/utils/utils.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1144,6 +1144,7 @@ def generate_qnn_executorch_compiler_spec(
11441144
def get_soc_to_arch_map():
11451145
return {
11461146
"SA8295": HtpArch.V68,
1147+
"SA8797": HtpArch.V81,
11471148
"SM8350": HtpArch.V68,
11481149
"SM8450": HtpArch.V69,
11491150
"SM8475": HtpArch.V69,
@@ -1168,6 +1169,7 @@ def get_soc_to_arch_map():
11681169
def get_soc_to_chipset_map():
11691170
return {
11701171
"SA8295": QcomChipset.SA8295,
1172+
"SA8797": QcomChipset.SA8797,
11711173
"SM8350": QcomChipset.SM8350,
11721174
"SM8450": QcomChipset.SM8450,
11731175
"SM8475": QcomChipset.SM8475,

docs/source/backends-overview.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,13 @@ Backends are the bridge between your exported model and the hardware it runs on.
1818

1919
## Choosing a Backend
2020

21-
| Backend | Platform(s) | Hardware Type | Typical Use Case |
22-
|--------------------------------------------------------------|-------------|---------------|---------------------------------|
23-
| [XNNPACK](backends/xnnpack/xnnpack-overview.md) | All | CPU | General-purpose, fallback |
24-
| [Core ML](/backends/coreml/coreml-overview.md) | iOS, macOS | NPU/GPU/CPU | Apple devices, high performance |
25-
| [Metal Performance Shaders](/backends/mps/mps-overview.md) | iOS, macOS | GPU | Apple GPU acceleration |
26-
| [Vulkan ](/backends/vulkan/vulkan-overview.md) | Android | GPU | Android GPU acceleration |
21+
| Backend | Platform(s) | Hardware Type | Typical Use Case |
22+
|--------------------------------------------------------------|---------------|---------------|---------------------------------|
23+
| [XNNPACK](backends/xnnpack/xnnpack-overview.md) | All | CPU | General-purpose, fallback |
24+
| [CUDA](/backends/cuda/cuda-overview.md) | Linux/Windows | GPU | NVIDIA GPU acceleration |
25+
| [Core ML](/backends/coreml/coreml-overview.md) | iOS, macOS | NPU/GPU/CPU | Apple devices, high performance |
26+
| [Metal Performance Shaders](/backends/mps/mps-overview.md) | iOS, macOS | GPU | Apple GPU acceleration |
27+
| [Vulkan ](/backends/vulkan/vulkan-overview.md) | Android | GPU | Android GPU acceleration |
2728
| [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs |
2829
| [MediaTek](backends-mediatek) | Android | NPU | MediaTek SoCs |
2930
| [Arm Ethos-U](/backends/arm-ethos-u/arm-ethos-u-overview.md) | Embedded | NPU | Arm MCUs |
@@ -51,6 +52,7 @@ Backends are the bridge between your exported model and the hardware it runs on.
5152
:caption: Backend Overview
5253
5354
backends-xnnpack
55+
backends/cuda/cuda-overview
5456
backends/coreml/coreml-overview
5557
backends-mps
5658
backends-vulkan

docs/source/backends-qualcomm.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ For more details and troubleshooting, refer to the official Microsoft WSL instal
6161
### Hardware:
6262
You will need an Android / Linux device with adb-connected running on one of below Qualcomm SoCs:
6363
- SA8295
64+
- SA8797 (also used by SA8397)
6465
- SM8450 (Snapdragon 8 Gen 1)
6566
- SM8475 (Snapdragon 8 Gen 1+)
6667
- SM8550 (Snapdragon 8 Gen 2)
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# CUDA Backend
2+
3+
The CUDA backend is the ExecuTorch solution for running models on NVIDIA GPUs. It leverages the [AOTInductor](https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html) compiler to generate optimized CUDA kernels with libtorch-free execution, and uses [Triton](https://triton-lang.org/) for high-performance GPU kernel generation.
4+
5+
## Features
6+
7+
- **Optimized GPU Execution**: Uses AOTInductor to generate highly optimized CUDA kernels for model operators
8+
- **Triton Kernel Support**: Leverages Triton for GEMM (General Matrix Multiply), convolution, and SDPA (Scaled Dot-Product Attention) kernels.
9+
- **Quantization Support**: INT4 weight quantization with tile-packed format for improved performance and reduced memory footprint
10+
- **Cross-Platform**: Supports both Linux and Windows platforms
11+
- **Multiple Model Support**: Works with various models including LLMs, vision-language models, and audio models
12+
13+
## Target Requirements
14+
15+
Below are the requirements for running a CUDA-delegated ExecuTorch model:
16+
17+
- **Hardware**: NVIDIA GPU with CUDA compute capability
18+
- **CUDA Toolkit**: CUDA 11.x or later (CUDA 12.x recommended)
19+
- **Operating System**: Linux or Windows
20+
- **Drivers**: PyTorch-Compatible NVIDIA GPU drivers installed
21+
22+
## Development Requirements
23+
24+
To develop and export models using the CUDA backend:
25+
26+
- **Python**: Python 3.8+
27+
- **PyTorch**: PyTorch with CUDA support
28+
- **ExecuTorch**: Install ExecuTorch with CUDA backend support
29+
30+
## Using the CUDA Backend
31+
32+
### Exporting Models with Python API
33+
34+
The CUDA backend uses the `CudaBackend` and `CudaPartitioner` classes to export models. Here is a complete example:
35+
36+
```python
37+
import torch
38+
from executorch.backends.cuda.cuda_backend import CudaBackend
39+
from executorch.backends.cuda.cuda_partitioner import CudaPartitioner
40+
from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower
41+
from executorch.extension.export_util.utils import save_pte_program
42+
43+
# Configure edge compilation
44+
edge_compile_config = EdgeCompileConfig(
45+
_check_ir_validity=False,
46+
_skip_dim_order=True,
47+
)
48+
49+
# Define your model
50+
model = YourModel().eval()
51+
example_inputs = (torch.randn(1, 3, 224, 224),)
52+
53+
# Export the model using torch.export
54+
exported_program = torch.export.export(model, example_inputs)
55+
56+
# Create the CUDA partitioner
57+
partitioner = CudaPartitioner(
58+
[CudaBackend.generate_method_name_compile_spec(model_name)]
59+
)
60+
61+
# Add decompositions for Triton to generate kernels
62+
exported_program = exported_program.run_decompositions({
63+
torch.ops.aten.conv1d.default: conv1d_to_conv2d,
64+
})
65+
66+
# Lower to ExecuTorch with CUDA backend
67+
et_program = to_edge_transform_and_lower(
68+
exported_program,
69+
partitioner=[partitioner],
70+
compile_config=edge_compile_config,
71+
)
72+
73+
# Convert to executable program and save
74+
exec_program = et_program.to_executorch()
75+
save_pte_program(exec_program, model_name, "./output_dir")
76+
```
77+
This generates `.pte` and `.ptd` files that can be executed on CUDA devices.
78+
79+
For a complete working example, see the [CUDA export script](https://github.com/pytorch/executorch/blob/main/examples/cuda/scripts/export.py).
80+
81+
82+
----
83+
84+
## Runtime Integration
85+
86+
To run the model on device, use the standard ExecuTorch runtime APIs. See [Running on Device](getting-started.md#running-on-device) for more information.
87+
88+
When building from source, pass `-DEXECUTORCH_BUILD_CUDA=ON` when configuring the CMake build to compile the CUDA backend.
89+
90+
```
91+
# CMakeLists.txt
92+
add_subdirectory("executorch")
93+
...
94+
target_link_libraries(
95+
my_target
96+
PRIVATE executorch
97+
extension_module_static
98+
extension_tensor
99+
aoti_cuda_backend)
100+
```
101+
102+
No additional steps are necessary to use the backend beyond linking the target. CUDA-delegated `.pte` and `.ptd` files will automatically run on the registered backend.
103+
104+
----
105+
106+
## Examples
107+
108+
For complete end-to-end examples of exporting and running models with the CUDA backend, see:
109+
110+
- [Whisper](https://github.com/pytorch/executorch/blob/main/examples/models/whisper/README.md) — Audio transcription model with CUDA support
111+
- [Voxtral](https://github.com/pytorch/executorch/blob/main/examples/models/voxtral/README.md) — Audio multimodal model with CUDA support
112+
- [Gemma3](https://github.com/pytorch/executorch/blob/main/examples/models/gemma3/README.md) — Vision-language model with CUDA support
113+
114+
These examples demonstrate the full workflow including model export, quantization options, building runners, and runtime execution.
115+
116+
ExecuTorch provides Makefile targets for building these example runners:
117+
118+
```bash
119+
make whisper-cuda # Build Whisper runner with CUDA
120+
make voxtral-cuda # Build Voxtral runner with CUDA
121+
make gemma3-cuda # Build Gemma3 runner with CUDA
122+
```

examples/cuda/scripts/export.py

Lines changed: 6 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,6 @@
2121
from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower
2222

2323
from executorch.extension.export_util.utils import save_pte_program
24-
from torch._inductor.decomposition import conv1d_to_conv2d
25-
from torch.nn.attention import SDPBackend
2624

2725
# Script to export a model with CUDA delegation.
2826

@@ -88,24 +86,17 @@ def main():
8886
kwargs=example_kwargs,
8987
dynamic_shapes=dynamic_shapes,
9088
)
91-
print(exported_programs)
9289

9390
partitioner = CudaPartitioner(
9491
[CudaBackend.generate_method_name_compile_spec(args.model_name)]
9592
)
96-
# Add decompositions for triton to generate kernels.
97-
exported_programs = exported_programs.run_decompositions(
98-
{
99-
torch.ops.aten.conv1d.default: conv1d_to_conv2d,
100-
}
93+
94+
et_prog = to_edge_transform_and_lower(
95+
exported_programs,
96+
partitioner=[partitioner],
97+
compile_config=_EDGE_COMPILE_CONFIG,
98+
generate_etrecord=args.generate_etrecord,
10199
)
102-
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]):
103-
et_prog = to_edge_transform_and_lower(
104-
exported_programs,
105-
partitioner=[partitioner],
106-
compile_config=_EDGE_COMPILE_CONFIG,
107-
generate_etrecord=args.generate_etrecord,
108-
)
109100
exec_program = et_prog.to_executorch()
110101
save_pte_program(exec_program, args.model_name, args.output_dir)
111102
if args.generate_etrecord:

examples/demo-apps/react-native/rnllama/package.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,8 @@
5555
"private": true,
5656
"resolutions": {
5757
"cookie": ">=0.7.0",
58-
"glob": "^10.5.0"
58+
"glob": "^10.5.0",
59+
"lodash": ">=4.17.23"
5960
},
6061
"packageManager": "yarn@1.22.22+sha512.a6b2f7906b721bba3d67d4aff083df04dad64c399707841b7acf00f6b133b7ac24255f2652fa22ae3534329dc6180534e98d17432037ff6fd140556e2bb3137e"
6162
}

examples/demo-apps/react-native/rnllama/yarn.lock

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4887,9 +4887,9 @@ lodash.throttle@^4.1.1:
48874887
integrity sha512-wIkUCfVKpVsWo3JSZlc+8MB5it+2AN5W8J7YVMST30UrvcQNZ1Okbj+rbVniijTWE6FGYy4XJq/rHkas8qJMLQ==
48884888

48894889
lodash@^4.17.19, lodash@^4.17.21:
4890-
version "4.17.21"
4891-
resolved "https://registry.yarnpkg.com/lodash/-/lodash-4.17.21.tgz#679591c564c3bffaae8454cf0b3df370c3d6911c"
4892-
integrity sha512-v2kDEe57lecTulaDIuNTPy3Ry4gLGJ6Z1O3vE1krgXZNrsQ+LFTGHVxVjcXPs17LhbZVGedAJv8XZ1tvj5FvSg==
4890+
version "4.17.23"
4891+
resolved "https://registry.yarnpkg.com/lodash/-/lodash-4.17.23.tgz#f113b0378386103be4f6893388c73d0bde7f2c5a"
4892+
integrity sha512-LgVTMpQtIopCi79SJeDiP0TfWi5CNEc/L/aRdTh3yIvmZXTnheWpKjSZhnvMl8iXbC1tFg9gdHHDMLoV7CnG+w==
48934893

48944894
log-symbols@^2.2.0:
48954895
version "2.2.0"

0 commit comments

Comments
 (0)