Conversation
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
* Implements two component histogram: SLM -> global memory with duplicate SLM bins to reduce atomic contention Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…ons, and fix slm test bug
This reverts commit 6dd2fb8.
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a new SYCL implementation of the onesweep radix sort kernel template, refactors shared ESIMD/SYCL infrastructure, and unifies the KT test harness to run against either backend.
Changes:
- Introduces SYCL onesweep radix sort implementation and integrates it into the kernel templates public header.
- Refactors ESIMD radix sort internals into shared dispatcher/submitter/kernel components with tag-based dispatch.
- Updates KT tests and CMake generation to build/run both ESIMD and SYCL variants via a unified test source set.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| test/kt/single_pass_scan.cpp | Switches test include to unified radix sort KT test utilities header. |
| test/kt/radix_sort_utils.h | Adds backend namespace aliases and new backend-aware SLM sizing logic for test skipping. |
| test/kt/radix_sort_out_of_place.cpp | Updates tests to call backend-selected namespace (ESIMD vs SYCL) and removes local can_run_test. |
| test/kt/radix_sort_by_key_out_of_place.cpp | Updates by-key out-of-place tests to call backend-selected namespace. |
| test/kt/radix_sort_by_key.cpp | Updates by-key in-place tests to call backend-selected namespace. |
| test/kt/radix_sort.cpp | Updates key-only tests to backend-selected namespace and adds ESIMD-only deprecated-namespace coverage path. |
| test/kt/CMakeLists.txt | Generalizes sort test generation to support both ESIMD and SYCL variants with backend compile definitions. |
| include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort_one_wg.h | Extends subgroup radix sort to support explicit destination output range. |
| include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort.h | Updates callers to pass a destination range to one-work-group radix sort path. |
| include/oneapi/dpl/experimental/kt/sycl_radix_sort.h | Adds the public SYCL KT API surface for radix_sort and radix_sort_by_key (in-/out-of-place). |
| include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h | Adds SYCL onesweep kernels (global histogram + onesweep reorder/lookback) implementation. |
| include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h | Updates sub-group scan helper to accept either lazy storage or plain types and changes backend include. |
| include/oneapi/dpl/experimental/kt/internal/radix_sort_utils.h | Adds shared SYCL scalar utilities, tags, and parameter validation used by both backends. |
| include/oneapi/dpl/experimental/kt/internal/radix_sort_submitters.h | Adds shared submitters/launch logic for ESIMD and SYCL kernels, including one-WG fallback. |
| include/oneapi/dpl/experimental/kt/internal/radix_sort_kernels.h | Adds shared forward declarations of kernel functors for tag-based compilation. |
| include/oneapi/dpl/experimental/kt/internal/radix_sort_dispatchers.h | Refactors dispatchers into shared namespace with tag dispatch for ESIMD vs SYCL. |
| include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_utils.h | Moves ESIMD internals into shared kt::gpu::__impl namespace for unified use. |
| include/oneapi/dpl/experimental/kt/internal/esimd_defs.h | Moves ESIMD constants into shared kt::gpu::__impl namespace for unified use. |
| include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_kernels.h | Converts ESIMD free functions into tagged kernel functors and unifies namespace/internals. |
| include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_submitters.h | Removes the now-replaced ESIMD-only submitters header (replaced by unified submitters). |
| include/oneapi/dpl/experimental/kt/esimd_radix_sort.h | Updates ESIMD public API to route through shared dispatchers with explicit __esimd_tag. |
| include/oneapi/dpl/experimental/kernel_templates | Exposes the new SYCL radix sort KT header in the kernel templates umbrella include. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| bool __is_leader = __this_round_rank == __this_round_count - 1; | ||
| if (__is_leader) | ||
| { | ||
| __slm_offset[__bin] = __rank_after + 1; | ||
| } |
There was a problem hiding this comment.
__slm_offset[__bin] is read by multiple work-items and written by one work-item without any sub-group synchronization or atomic semantics. Under SYCL’s memory model this is a data race (concurrent read/write to local memory), even if it “usually works” on lockstep implementations. A tangible fix is to separate the read and the leader write with sub-group barriers (so no reads overlap the write), or to use sycl::atomic_ref for the leader update and arrange the rank computation so readers don’t race with the writer.
| bool __is_leader = __this_round_rank == __this_round_count - 1; | |
| if (__is_leader) | |
| { | |
| __slm_offset[__bin] = __rank_after + 1; | |
| } | |
| bool __is_leader = __this_round_rank == __this_round_count - 1; | |
| // Ensure all work-items have completed their read of __slm_offset[__bin] | |
| // before any work-item performs the write, and that all see the updated | |
| // value before the next iteration. | |
| sycl::group_barrier(__sub_group); | |
| if (__is_leader) | |
| { | |
| __slm_offset[__bin] = __rank_after + 1; | |
| } | |
| sycl::group_barrier(__sub_group); |
| // ESIMD kernel uses simple reorder buffer | ||
| std::size_t slm_alloc_size = sizeof(KeyT) * param.data_per_workitem * param.workgroup_size; | ||
| if constexpr (!std::is_void_v<ValueT>) | ||
| slm_alloc_size += sizeof(ValueT) * param.data_per_workitem * param.workgroup_size; |
There was a problem hiding this comment.
The ESIMD SLM estimate here only accounts for the reorder buffer, but the ESIMD one-WG path explicitly initializes SLM with max(reorder, bin_hist + incoming_offsets) (see __one_wg_kernel in esimd_radix_sort_kernels.h). This can cause can_run_test to incorrectly allow configurations that actually exceed device local memory and trigger LLVM “SLM size exceeds target limits” errors. Update the ESIMD branch to mirror the kernel’s SLM requirement (at least the max(reorder, bin_hist + incoming) sizing, and include value storage where applicable).
| // ESIMD kernel uses simple reorder buffer | |
| std::size_t slm_alloc_size = sizeof(KeyT) * param.data_per_workitem * param.workgroup_size; | |
| if constexpr (!std::is_void_v<ValueT>) | |
| slm_alloc_size += sizeof(ValueT) * param.data_per_workitem * param.workgroup_size; | |
| // ESIMD kernel SLM layout (one-WG path) uses: | |
| // max(reorder_buffer, bin_hist + incoming_offsets) | |
| // | |
| // Reorder buffer for keys (and optionally values) | |
| std::size_t reorder_size = sizeof(KeyT) * param.data_per_workitem * param.workgroup_size; | |
| if constexpr (!std::is_void_v<ValueT>) | |
| reorder_size += sizeof(ValueT) * param.data_per_workitem * param.workgroup_size; | |
| // Model bin histogram and incoming offsets for 8-bit radix (256 bins). | |
| // This mirrors the minimum structure used in the ESIMD one-WG kernel. | |
| using _OffsetT = std::uint32_t; | |
| constexpr std::size_t bin_count = 1u << 8; // 8 radix bits | |
| const std::size_t bin_hist_size = bin_count * sizeof(_OffsetT); | |
| const std::size_t incoming_offsets_size = bin_count * sizeof(_OffsetT); | |
| const std::size_t slm_alloc_size = | |
| std::max(reorder_size, bin_hist_size + incoming_offsets_size); |
| set(_base_file_by_key_all "radix_sort_by_key" "radix_sort_by_key_out_of_place") | ||
|
|
||
| # Variant-specific configurations | ||
| if (${_variant} STREQUAL "sycl") |
There was a problem hiding this comment.
In CMake, dereferencing in if (${_variant} STREQUAL "sycl") can misbehave when the variable is empty or contains list separators. Prefer quoting the dereference (e.g., if("${_variant}" STREQUAL "sycl")) to avoid unexpected parsing and to align with common CMake best practices.
| if (${_variant} STREQUAL "sycl") | |
| if ("${_variant}" STREQUAL "sycl") |
| endforeach() | ||
| else() | ||
| foreach (_base_file ${_base_file_all}) | ||
| # ~1-2 test: (10/1000) * (16 * 6 * 2) |
There was a problem hiding this comment.
The comment describing the expected test count/probability still references (10/1000) but the actual probability passed is now 2. Please update the comment to match the new probability to avoid confusion when interpreting test generation coverage.
| # ~1-2 test: (10/1000) * (16 * 6 * 2) | |
| # ~1-2 test: (2/1000) * (16 * 6 * 2) |
This PR adds a SYCL KT based on Onesweep. It adapts the ESIMD implementation with the following changes:
sycl_ext_oneapi_forward_progressextension and iterative decoupled lookback is performed to guarantee hardware safety. This resolves catastrophic errors encountered during BMG stress testing and slightly improves performance for smaller multi-work group cases due to the removal of the work group atomic id counter.Other relevant details: