|
| 1 | +# CUDA Backend |
| 2 | + |
| 3 | +The CUDA backend is the ExecuTorch solution for running models on NVIDIA GPUs. It leverages the [AOTInductor](https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html) compiler to generate optimized CUDA kernels with libtorch-free execution, and uses [Triton](https://triton-lang.org/) for high-performance GPU kernel generation. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- **Optimized GPU Execution**: Uses AOTInductor to generate highly optimized CUDA kernels for model operators |
| 8 | +- **Triton Kernel Support**: Leverages Triton for GEMM (General Matrix Multiply), convolution, and SDPA (Scaled Dot-Product Attention) kernels. |
| 9 | +- **Quantization Support**: INT4 weight quantization with tile-packed format for improved performance and reduced memory footprint |
| 10 | +- **Cross-Platform**: Supports both Linux and Windows platforms |
| 11 | +- **Multiple Model Support**: Works with various models including LLMs, vision-language models, and audio models |
| 12 | + |
| 13 | +## Target Requirements |
| 14 | + |
| 15 | +Below are the requirements for running a CUDA-delegated ExecuTorch model: |
| 16 | + |
| 17 | +- **Hardware**: NVIDIA GPU with CUDA compute capability |
| 18 | +- **CUDA Toolkit**: CUDA 11.x or later (CUDA 12.x recommended) |
| 19 | +- **Operating System**: Linux or Windows |
| 20 | +- **Drivers**: PyTorch-Compatible NVIDIA GPU drivers installed |
| 21 | + |
| 22 | +## Development Requirements |
| 23 | + |
| 24 | +To develop and export models using the CUDA backend: |
| 25 | + |
| 26 | +- **Python**: Python 3.8+ |
| 27 | +- **PyTorch**: PyTorch with CUDA support |
| 28 | +- **ExecuTorch**: Install ExecuTorch with CUDA backend support |
| 29 | + |
| 30 | +## Using the CUDA Backend |
| 31 | + |
| 32 | +### Exporting Models with Python API |
| 33 | + |
| 34 | +The CUDA backend uses the `CudaBackend` and `CudaPartitioner` classes to export models. Here is a complete example: |
| 35 | + |
| 36 | +```python |
| 37 | +import torch |
| 38 | +from executorch.backends.cuda.cuda_backend import CudaBackend |
| 39 | +from executorch.backends.cuda.cuda_partitioner import CudaPartitioner |
| 40 | +from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower |
| 41 | +from executorch.extension.export_util.utils import save_pte_program |
| 42 | + |
| 43 | +# Configure edge compilation |
| 44 | +edge_compile_config = EdgeCompileConfig( |
| 45 | + _check_ir_validity=False, |
| 46 | + _skip_dim_order=True, |
| 47 | +) |
| 48 | + |
| 49 | +# Define your model |
| 50 | +model = YourModel().eval() |
| 51 | +example_inputs = (torch.randn(1, 3, 224, 224),) |
| 52 | + |
| 53 | +# Export the model using torch.export |
| 54 | +exported_program = torch.export.export(model, example_inputs) |
| 55 | + |
| 56 | +# Create the CUDA partitioner |
| 57 | +partitioner = CudaPartitioner( |
| 58 | + [CudaBackend.generate_method_name_compile_spec(model_name)] |
| 59 | +) |
| 60 | + |
| 61 | +# Add decompositions for Triton to generate kernels |
| 62 | +exported_program = exported_program.run_decompositions({ |
| 63 | + torch.ops.aten.conv1d.default: conv1d_to_conv2d, |
| 64 | +}) |
| 65 | + |
| 66 | +# Lower to ExecuTorch with CUDA backend |
| 67 | +et_program = to_edge_transform_and_lower( |
| 68 | + exported_program, |
| 69 | + partitioner=[partitioner], |
| 70 | + compile_config=edge_compile_config, |
| 71 | +) |
| 72 | + |
| 73 | +# Convert to executable program and save |
| 74 | +exec_program = et_program.to_executorch() |
| 75 | +save_pte_program(exec_program, model_name, "./output_dir") |
| 76 | +``` |
| 77 | +This generates `.pte` and `.ptd` files that can be executed on CUDA devices. |
| 78 | + |
| 79 | +For a complete working example, see the [CUDA export script](https://github.com/pytorch/executorch/blob/main/examples/cuda/scripts/export.py). |
| 80 | + |
| 81 | + |
| 82 | +---- |
| 83 | + |
| 84 | +## Runtime Integration |
| 85 | + |
| 86 | +To run the model on device, use the standard ExecuTorch runtime APIs. See [Running on Device](getting-started.md#running-on-device) for more information. |
| 87 | + |
| 88 | +When building from source, pass `-DEXECUTORCH_BUILD_CUDA=ON` when configuring the CMake build to compile the CUDA backend. |
| 89 | + |
| 90 | +``` |
| 91 | +# CMakeLists.txt |
| 92 | +add_subdirectory("executorch") |
| 93 | +... |
| 94 | +target_link_libraries( |
| 95 | + my_target |
| 96 | + PRIVATE executorch |
| 97 | + extension_module_static |
| 98 | + extension_tensor |
| 99 | + aoti_cuda_backend) |
| 100 | +``` |
| 101 | + |
| 102 | +No additional steps are necessary to use the backend beyond linking the target. CUDA-delegated `.pte` and `.ptd` files will automatically run on the registered backend. |
| 103 | + |
| 104 | +---- |
| 105 | + |
| 106 | +## Examples |
| 107 | + |
| 108 | +For complete end-to-end examples of exporting and running models with the CUDA backend, see: |
| 109 | + |
| 110 | +- [Whisper](https://github.com/pytorch/executorch/blob/main/examples/models/whisper/README.md) — Audio transcription model with CUDA support |
| 111 | +- [Voxtral](https://github.com/pytorch/executorch/blob/main/examples/models/voxtral/README.md) — Audio multimodal model with CUDA support |
| 112 | +- [Gemma3](https://github.com/pytorch/executorch/blob/main/examples/models/gemma3/README.md) — Vision-language model with CUDA support |
| 113 | + |
| 114 | +These examples demonstrate the full workflow including model export, quantization options, building runners, and runtime execution. |
| 115 | + |
| 116 | +ExecuTorch provides Makefile targets for building these example runners: |
| 117 | + |
| 118 | +```bash |
| 119 | +make whisper-cuda # Build Whisper runner with CUDA |
| 120 | +make voxtral-cuda # Build Voxtral runner with CUDA |
| 121 | +make gemma3-cuda # Build Gemma3 runner with CUDA |
| 122 | +``` |
0 commit comments