Skip to content

Latest commit

 

History

History
251 lines (192 loc) · 7.13 KB

File metadata and controls

251 lines (192 loc) · 7.13 KB

CK Tile Comprehensive Tutorial Index

This tutorial series provides an in-depth understanding of the Composable Kernel (CK) Tile programming model, with enhanced Python implementations that include C++ code correspondence, visualizations, and step-by-step explanations.

Overview

The CK Tile programming model is a high-performance abstraction for GPU kernel development that provides:

  • Tile-based data distribution across GPU compute units
  • Efficient memory access patterns with coalescing and vectorization
  • Flexible tensor operations for complex algorithms
  • Hardware-optimized implementations for AMD GPUs

Tutorial Structure

1. Tile Distribution Tutorial (tile_distribution_tutorial.py)

Learn how tensor data is distributed across GPU processing elements.

Key Concepts:

  • Coordinate systems (X, Y, P, R)
  • Thread-to-data mapping
  • Warp and block-level distribution
  • Space-filling curves and swizzling

Highlights:

  • Interactive visualizations of distribution patterns
  • C++ code snippets showing actual CK implementations
  • Common patterns: GEMM, convolution, reduction
  • Performance optimization strategies

Example Usage:

from pytensor.tile_distribution_tutorial import run_interactive_tutorial
run_interactive_tutorial()

2. Tile Window Tutorial (tile_window_tutorial.py)

Master the tile window abstraction for efficient tensor access.

Key Concepts:

  • Window views into tensor memory
  • Load/store operations
  • Memory coalescing
  • Boundary handling

Highlights:

  • Step-by-step memory access visualization
  • Vectorized I/O demonstrations
  • Data layout impact on performance
  • Complete GEMM example with tile windows

Example Usage:

from pytensor.tile_window_tutorial import run_tile_window_tutorial
run_tile_window_tutorial()

3. Tensor Operations Tutorial (tensor_operations_tutorial.py)

Explore the complete set of tensor operations available in CK.

Key Concepts:

  • load_tile / store_tile
  • shuffle_tile (inter-thread communication)
  • update_tile (element-wise operations)
  • sweep_tile (reductions and scans)

Highlights:

  • Operation lifecycle visualization
  • Fusion strategies for performance
  • Real-world examples (GEMM, LayerNorm)
  • Performance optimization checklist

Example Usage:

from pytensor.tensor_operations_tutorial import run_tensor_operations_tutorial
run_tensor_operations_tutorial()

Learning Path

Beginner Path

  1. Start with tile_distribution_tutorial.py - Section 1 (Core Concepts)
  2. Move to tile_window_tutorial.py - Section 1 (Basic Operations)
  3. Try simple examples in tensor_operations_tutorial.py - Section 1 (Load/Store)

Intermediate Path

  1. Study GEMM distribution patterns in tile_distribution_tutorial.py
  2. Understand memory coalescing in tile_window_tutorial.py
  3. Learn operation fusion in tensor_operations_tutorial.py

Advanced Path

  1. Master hierarchical tiling and swizzling patterns
  2. Optimize memory access with vectorization
  3. Implement custom kernels using the full operation set

C++ Integration

Each tutorial module includes extensive C++ code snippets that show:

  1. Direct CK Library Usage

    // From actual CK headers
    template <typename TileDistribution>
    struct tile_window_with_static_distribution {
        // Implementation details with explanations
    };
  2. Kernel Implementation Patterns

    // Complete kernel examples
    template <typename TileShape>
    __global__ void gemm_kernel(...) {
        // Step-by-step implementation
    }
  3. Performance Optimizations

    // Vectorized access, shuffle operations, etc.
    using float4 = vector_type<float, 4>::type;

Key Features of Enhanced Tutorials

1. Progressive Complexity

  • Start with simple 2D examples
  • Build up to complex 3D tensor operations
  • Real-world kernel implementations

2. Interactive Visualizations

  • Matplotlib-based diagrams
  • Thread-to-memory mapping
  • Performance comparisons

3. Comprehensive Explanations

  • Detailed docstrings
  • Step-by-step execution traces
  • Common pitfalls and solutions

4. Performance Focus

  • Bandwidth utilization analysis
  • Optimization checklists
  • Hardware-specific considerations

Prerequisites

Software Requirements

  • Python 3.8+
  • NumPy
  • Matplotlib
  • (Optional) ROCm SDK for running actual C++ code

Knowledge Requirements

  • Basic understanding of GPU architecture
  • Familiarity with parallel programming concepts
  • C++ knowledge helpful but not required

Getting Started

  1. Import the tutorials:

    import sys
    sys.path.append('path/to/pythonck')
    
    from pytensor import tile_distribution_tutorial
    from pytensor import tile_window_tutorial
    from pytensor import tensor_operations_tutorial
  2. Run interactive tutorials:

    # Start with tile distribution
    tile_distribution_tutorial.run_interactive_tutorial()
  3. Explore specific concepts:

    # Deep dive into memory coalescing
    from pytensor.tile_window_tutorial import MemoryAccessPatterns
    MemoryAccessPatterns.demonstrate_coalescing()

Advanced Topics

Custom Kernel Development

After completing the tutorials, you'll be able to:

  • Design efficient tile distributions for your algorithms
  • Implement high-performance kernels using CK abstractions
  • Optimize memory access patterns
  • Debug and profile GPU kernels

Integration with CK Library

The Python tutorials directly correspond to C++ CK usage:

# Python tutorial code
dist = TileDistributionTutorial(...)
window = TileWindowTutorial(...)

# Corresponds to C++ CK code
tile_distribution<...> dist{...};
tile_window<...> window{...};

Contributing

To extend these tutorials:

  1. Add New Operations

    • Implement in Python following the existing pattern
    • Include C++ correspondence
    • Add visualizations
  2. Create Domain-Specific Examples

    • Machine learning operations
    • Scientific computing kernels
    • Image processing algorithms
  3. Improve Visualizations

    • Add animation support
    • 3D visualizations for complex patterns
    • Performance profiling graphs

Resources

CK Documentation

Related Tutorials

  • GEMM optimization guides
  • Tensor operation fusion strategies
  • GPU performance analysis

Summary

These enhanced tutorials provide a comprehensive learning experience for the CK Tile programming model by:

  1. Bridging Theory and Practice - Python implementations with C++ code
  2. Visual Learning - Extensive visualizations and diagrams
  3. Hands-on Experience - Interactive examples and exercises
  4. Performance Focus - Optimization strategies and best practices

Start your journey with the tile distribution tutorial and progressively build your understanding of high-performance GPU kernel development with CK!


Note: These tutorials are designed to complement the official CK documentation and provide an accessible learning path for developers new to the CK programming model.