This tutorial series provides an in-depth understanding of the Composable Kernel (CK) Tile programming model, with enhanced Python implementations that include C++ code correspondence, visualizations, and step-by-step explanations.
The CK Tile programming model is a high-performance abstraction for GPU kernel development that provides:
- Tile-based data distribution across GPU compute units
- Efficient memory access patterns with coalescing and vectorization
- Flexible tensor operations for complex algorithms
- Hardware-optimized implementations for AMD GPUs
Learn how tensor data is distributed across GPU processing elements.
Key Concepts:
- Coordinate systems (X, Y, P, R)
- Thread-to-data mapping
- Warp and block-level distribution
- Space-filling curves and swizzling
Highlights:
- Interactive visualizations of distribution patterns
- C++ code snippets showing actual CK implementations
- Common patterns: GEMM, convolution, reduction
- Performance optimization strategies
Example Usage:
from pytensor.tile_distribution_tutorial import run_interactive_tutorial
run_interactive_tutorial()Master the tile window abstraction for efficient tensor access.
Key Concepts:
- Window views into tensor memory
- Load/store operations
- Memory coalescing
- Boundary handling
Highlights:
- Step-by-step memory access visualization
- Vectorized I/O demonstrations
- Data layout impact on performance
- Complete GEMM example with tile windows
Example Usage:
from pytensor.tile_window_tutorial import run_tile_window_tutorial
run_tile_window_tutorial()Explore the complete set of tensor operations available in CK.
Key Concepts:
- load_tile / store_tile
- shuffle_tile (inter-thread communication)
- update_tile (element-wise operations)
- sweep_tile (reductions and scans)
Highlights:
- Operation lifecycle visualization
- Fusion strategies for performance
- Real-world examples (GEMM, LayerNorm)
- Performance optimization checklist
Example Usage:
from pytensor.tensor_operations_tutorial import run_tensor_operations_tutorial
run_tensor_operations_tutorial()- Start with
tile_distribution_tutorial.py- Section 1 (Core Concepts) - Move to
tile_window_tutorial.py- Section 1 (Basic Operations) - Try simple examples in
tensor_operations_tutorial.py- Section 1 (Load/Store)
- Study GEMM distribution patterns in
tile_distribution_tutorial.py - Understand memory coalescing in
tile_window_tutorial.py - Learn operation fusion in
tensor_operations_tutorial.py
- Master hierarchical tiling and swizzling patterns
- Optimize memory access with vectorization
- Implement custom kernels using the full operation set
Each tutorial module includes extensive C++ code snippets that show:
-
Direct CK Library Usage
// From actual CK headers template <typename TileDistribution> struct tile_window_with_static_distribution { // Implementation details with explanations };
-
Kernel Implementation Patterns
// Complete kernel examples template <typename TileShape> __global__ void gemm_kernel(...) { // Step-by-step implementation }
-
Performance Optimizations
// Vectorized access, shuffle operations, etc. using float4 = vector_type<float, 4>::type;
- Start with simple 2D examples
- Build up to complex 3D tensor operations
- Real-world kernel implementations
- Matplotlib-based diagrams
- Thread-to-memory mapping
- Performance comparisons
- Detailed docstrings
- Step-by-step execution traces
- Common pitfalls and solutions
- Bandwidth utilization analysis
- Optimization checklists
- Hardware-specific considerations
- Python 3.8+
- NumPy
- Matplotlib
- (Optional) ROCm SDK for running actual C++ code
- Basic understanding of GPU architecture
- Familiarity with parallel programming concepts
- C++ knowledge helpful but not required
-
Import the tutorials:
import sys sys.path.append('path/to/pythonck') from pytensor import tile_distribution_tutorial from pytensor import tile_window_tutorial from pytensor import tensor_operations_tutorial
-
Run interactive tutorials:
# Start with tile distribution tile_distribution_tutorial.run_interactive_tutorial()
-
Explore specific concepts:
# Deep dive into memory coalescing from pytensor.tile_window_tutorial import MemoryAccessPatterns MemoryAccessPatterns.demonstrate_coalescing()
After completing the tutorials, you'll be able to:
- Design efficient tile distributions for your algorithms
- Implement high-performance kernels using CK abstractions
- Optimize memory access patterns
- Debug and profile GPU kernels
The Python tutorials directly correspond to C++ CK usage:
# Python tutorial code
dist = TileDistributionTutorial(...)
window = TileWindowTutorial(...)
# Corresponds to C++ CK code
tile_distribution<...> dist{...};
tile_window<...> window{...};To extend these tutorials:
-
Add New Operations
- Implement in Python following the existing pattern
- Include C++ correspondence
- Add visualizations
-
Create Domain-Specific Examples
- Machine learning operations
- Scientific computing kernels
- Image processing algorithms
-
Improve Visualizations
- Add animation support
- 3D visualizations for complex patterns
- Performance profiling graphs
- GEMM optimization guides
- Tensor operation fusion strategies
- GPU performance analysis
These enhanced tutorials provide a comprehensive learning experience for the CK Tile programming model by:
- Bridging Theory and Practice - Python implementations with C++ code
- Visual Learning - Extensive visualizations and diagrams
- Hands-on Experience - Interactive examples and exercises
- Performance Focus - Optimization strategies and best practices
Start your journey with the tile distribution tutorial and progressively build your understanding of high-performance GPU kernel development with CK!
Note: These tutorials are designed to complement the official CK documentation and provide an accessible learning path for developers new to the CK programming model.