Implement a SYCL Onesweep Radix Sort KT by mmichel11 · Pull Request #2575 · uxlfoundation/oneDPL

mmichel11 · 2026-02-06T22:07:02Z

This PR adds a SYCL KT based on Onesweep. It adapts the ESIMD implementation with the following changes:

The histogram kernel has been optimized to reduce GRF usage in order to process all data in a single pass instead of two. An SLM atomic based approach with duplicate binning is used. The speedup over large inputs scales to be ~2x faster ESIMD.
Thread (sub-group) bincount offsets are removed from the onesweep kernel due to performance reasons and programming model differences. Sub-group ballot is used to count within a sub-group.
Cooperative groups are used via the SYCL sycl_ext_oneapi_forward_progress extension and iterative decoupled lookback is performed to guarantee hardware safety. This resolves catastrophic errors encountered during BMG stress testing and slightly improves performance for smaller multi-work group cases due to the removal of the work group atomic id counter.
For single work-group, oneDPL sort is used with some small sub-group size changes to avoid work-group size limitations on PVC (encountered at runtime).

Other relevant details:

The diff is very large in part due to the restructuring of the ESIMD kernels. Please note most of these changes are just indentation and the only "real" ESIMD changes are the conversions from functions to structs with tag dispatch and the unification of dispatchers / submitters to share code with the SYCL version.
Testing is unified between ESIMD / SYCL.

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

* Implements two component histogram: SLM -> global memory with duplicate SLM bins to reduce atomic contention Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…ons, and fix slm test bug

…atch

This reverts commit 6dd2fb8.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…pinning lookback

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Copilot

Pull request overview

This PR adds a new SYCL implementation of the onesweep radix sort kernel template, refactors shared ESIMD/SYCL infrastructure, and unifies the KT test harness to run against either backend.

Changes:

Introduces SYCL onesweep radix sort implementation and integrates it into the kernel templates public header.
Refactors ESIMD radix sort internals into shared dispatcher/submitter/kernel components with tag-based dispatch.
Updates KT tests and CMake generation to build/run both ESIMD and SYCL variants via a unified test source set.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
test/kt/single_pass_scan.cpp	Switches test include to unified radix sort KT test utilities header.
test/kt/radix_sort_utils.h	Adds backend namespace aliases and new backend-aware SLM sizing logic for test skipping.
test/kt/radix_sort_out_of_place.cpp	Updates tests to call backend-selected namespace (ESIMD vs SYCL) and removes local can_run_test.
test/kt/radix_sort_by_key_out_of_place.cpp	Updates by-key out-of-place tests to call backend-selected namespace.
test/kt/radix_sort_by_key.cpp	Updates by-key in-place tests to call backend-selected namespace.
test/kt/radix_sort.cpp	Updates key-only tests to backend-selected namespace and adds ESIMD-only deprecated-namespace coverage path.
test/kt/CMakeLists.txt	Generalizes sort test generation to support both ESIMD and SYCL variants with backend compile definitions.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort_one_wg.h	Extends subgroup radix sort to support explicit destination output range.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort.h	Updates callers to pass a destination range to one-work-group radix sort path.
include/oneapi/dpl/experimental/kt/sycl_radix_sort.h	Adds the public SYCL KT API surface for radix_sort and radix_sort_by_key (in-/out-of-place).
include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h	Adds SYCL onesweep kernels (global histogram + onesweep reorder/lookback) implementation.
include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h	Updates sub-group scan helper to accept either lazy storage or plain types and changes backend include.
include/oneapi/dpl/experimental/kt/internal/radix_sort_utils.h	Adds shared SYCL scalar utilities, tags, and parameter validation used by both backends.
include/oneapi/dpl/experimental/kt/internal/radix_sort_submitters.h	Adds shared submitters/launch logic for ESIMD and SYCL kernels, including one-WG fallback.
include/oneapi/dpl/experimental/kt/internal/radix_sort_kernels.h	Adds shared forward declarations of kernel functors for tag-based compilation.
include/oneapi/dpl/experimental/kt/internal/radix_sort_dispatchers.h	Refactors dispatchers into shared namespace with tag dispatch for ESIMD vs SYCL.
include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_utils.h	Moves ESIMD internals into shared kt::gpu::__impl namespace for unified use.
include/oneapi/dpl/experimental/kt/internal/esimd_defs.h	Moves ESIMD constants into shared kt::gpu::__impl namespace for unified use.
include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_kernels.h	Converts ESIMD free functions into tagged kernel functors and unifies namespace/internals.
include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_submitters.h	Removes the now-replaced ESIMD-only submitters header (replaced by unified submitters).
include/oneapi/dpl/experimental/kt/esimd_radix_sort.h	Updates ESIMD public API to route through shared dispatchers with explicit __esimd_tag.
include/oneapi/dpl/experimental/kernel_templates	Exposes the new SYCL radix sort KT header in the kernel templates umbrella include.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

Copilot · 2026-02-09T15:31:06Z

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

+            bool __is_leader = __this_round_rank == __this_round_count - 1;
+            if (__is_leader)
+            {
+                __slm_offset[__bin] = __rank_after + 1;
+            }


__slm_offset[__bin] is read by multiple work-items and written by one work-item without any sub-group synchronization or atomic semantics. Under SYCL’s memory model this is a data race (concurrent read/write to local memory), even if it “usually works” on lockstep implementations. A tangible fix is to separate the read and the leader write with sub-group barriers (so no reads overlap the write), or to use sycl::atomic_ref for the leader update and arrange the rank computation so readers don’t race with the writer.

Suggested change

bool __is_leader = __this_round_rank == __this_round_count - 1;

if (__is_leader)

{

__slm_offset[__bin] = __rank_after + 1;

}

bool __is_leader = __this_round_rank == __this_round_count - 1;

// Ensure all work-items have completed their read of __slm_offset[__bin]

// before any work-item performs the write, and that all see the updated

// value before the next iteration.

sycl::group_barrier(__sub_group);

if (__is_leader)

{

__slm_offset[__bin] = __rank_after + 1;

}

sycl::group_barrier(__sub_group);

Copilot · 2026-02-09T15:31:06Z

test/kt/radix_sort_utils.h

+    // ESIMD kernel uses simple reorder buffer
    std::size_t slm_alloc_size = sizeof(KeyT) * param.data_per_workitem * param.workgroup_size;
    if constexpr (!std::is_void_v<ValueT>)
        slm_alloc_size += sizeof(ValueT) * param.data_per_workitem * param.workgroup_size;


The ESIMD SLM estimate here only accounts for the reorder buffer, but the ESIMD one-WG path explicitly initializes SLM with max(reorder, bin_hist + incoming_offsets) (see __one_wg_kernel in esimd_radix_sort_kernels.h). This can cause can_run_test to incorrectly allow configurations that actually exceed device local memory and trigger LLVM “SLM size exceeds target limits” errors. Update the ESIMD branch to mirror the kernel’s SLM requirement (at least the max(reorder, bin_hist + incoming) sizing, and include value storage where applicable).

Suggested change

// ESIMD kernel uses simple reorder buffer

std::size_t slm_alloc_size = sizeof(KeyT) * param.data_per_workitem * param.workgroup_size;

if constexpr (!std::is_void_v<ValueT>)

slm_alloc_size += sizeof(ValueT) * param.data_per_workitem * param.workgroup_size;

// ESIMD kernel SLM layout (one-WG path) uses:

// max(reorder_buffer, bin_hist + incoming_offsets)

//

// Reorder buffer for keys (and optionally values)

std::size_t reorder_size = sizeof(KeyT) * param.data_per_workitem * param.workgroup_size;

if constexpr (!std::is_void_v<ValueT>)

reorder_size += sizeof(ValueT) * param.data_per_workitem * param.workgroup_size;

// Model bin histogram and incoming offsets for 8-bit radix (256 bins).

// This mirrors the minimum structure used in the ESIMD one-WG kernel.

using _OffsetT = std::uint32_t;

constexpr std::size_t bin_count = 1u << 8; // 8 radix bits

const std::size_t bin_hist_size = bin_count * sizeof(_OffsetT);

const std::size_t incoming_offsets_size = bin_count * sizeof(_OffsetT);

const std::size_t slm_alloc_size =

std::max(reorder_size, bin_hist_size + incoming_offsets_size);

Copilot · 2026-02-09T15:31:06Z

test/kt/CMakeLists.txt

+    set(_base_file_by_key_all "radix_sort_by_key" "radix_sort_by_key_out_of_place")
+
+    # Variant-specific configurations
+    if (${_variant} STREQUAL "sycl")


In CMake, dereferencing in if (${_variant} STREQUAL "sycl") can misbehave when the variable is empty or contains list separators. Prefer quoting the dereference (e.g., if("${_variant}" STREQUAL "sycl")) to avoid unexpected parsing and to align with common CMake best practices.

Suggested change

if (${_variant} STREQUAL "sycl")

if ("${_variant}" STREQUAL "sycl")

Copilot · 2026-02-09T15:31:07Z

test/kt/CMakeLists.txt

                    endforeach()
                else()
                    foreach (_base_file ${_base_file_all})
                        # ~1-2 test: (10/1000) * (16 * 6 * 2)


The comment describing the expected test count/probability still references (10/1000) but the actual probability passed is now 2. Please update the comment to match the new probability to avoid confusion when interpreting test generation coverage.

Suggested change

# ~1-2 test: (10/1000) * (16 * 6 * 2)

# ~1-2 test: (2/1000) * (16 * 6 * 2)

danhoeflinger and others added 30 commits January 6, 2026 14:45

initial esimd impl

3aa3f66

refactor with shared infrastructure

e5c02a0

minor fix

4bb6832

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

filling in testing infrastructure for sycl impl

af70c7e

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

revert erroneous change

ba742e3

onesweep: implement SYCL histogram kernel (#2559)

3ea8511

* Implements two component histogram: SLM -> global memory with duplicate SLM bins to reduce atomic contention Signed-off-by: Matthew Michel <matthew.michel@intel.com>

onesweep feature branch: implement SYCL onesweep kernel (#2567)

b514104

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

initial cleanup, break-up onesweep into smaller understandable functi…

a4f71b1

…ons, and fix slm test bug

single work group with subgroup sort

9ddf218

extend single work group dpl sort to support out of place

c9fc350

remove in place template

be5c7b6

update checks on params

d954f24

make global_histogram in sycl a struct

a94c2ea

Specializations of operator() for submitters instead of function disp…

1e05453

…atch

remove obsolete todos

c781586

unified test structure

82988e3

test util class

f264928

remove duplicate tests

969a3c2

update includes

cbd4970

fix xe2 hangs with atomic fence after batches of relaxed loads

338e485

fix scan kt include

9c650af

workaround issue with work group size 1024 on xe

895848a

extend work group support

62c2421

fix unskipped test cases when slm requirement is too big

e82313f

cleanup

db5abfc

remove unrelated changes

e9bdfb7

more cleanup

74728de

::std -> std

6dd2fb8

Revert "::std -> std"

c27f6f1

This reverts commit 6dd2fb8.

limit wg size to 1024 due to observed crashes with 256

4cd27eb

mmichel11 added 14 commits February 4, 2026 17:09

::std -> std changes in diff only

c997796

restore change from main

6405d9a

remove redundant SYCL queries

887ff57

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

remove auto and restructuring

3a27616

coop prototype

e9dfd2f

Full development of cooperative impl with driver bug workaround

a590405

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add work group sizes 256, 512 as tested

e453e27

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Better workaround for driver bug and fix suspected codegen issue in s…

6d7d20a

…pinning lookback

Formatting and work group restrictions

549e5bc

adjust slm approximation in testing

e9af095

Remove duplication in tests

47ed102

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove atomic id ptr allocation

54cec3e

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove work group size of 256 for now

319e59b

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

copyright updates

2ee20cf

mmichel11 added this to the 2022.12.0 milestone Feb 6, 2026

clang-format and minor cleanup

6f8c98e

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 changed the title ~~Implement a SYCL Onesweep KT~~ Implement a SYCL Onesweep Radix Sort KT Feb 6, 2026

mmichel11 marked this pull request as ready for review February 9, 2026 14:31

mmichel11 requested review from MikeDvorskiy, SergeyKopienko, adamfidel, akukanov, Copilot, danhoeflinger and dmitriy-sobolev February 9, 2026 15:26

Copilot AI reviewed Feb 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a SYCL Onesweep Radix Sort KT#2575

Implement a SYCL Onesweep Radix Sort KT#2575
mmichel11 wants to merge 45 commits intomainfrom
dev/mmichel11/onesweep_kt_coop

mmichel11 commented Feb 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    // ESIMD kernel uses simple reorder buffer
-    std::size_t slm_alloc_size = sizeof(KeyT) * param.data_per_workitem * param.workgroup_size;
-    if constexpr (!std::is_void_v<ValueT>)
-        slm_alloc_size += sizeof(ValueT) * param.data_per_workitem * param.workgroup_size;
+    // ESIMD kernel SLM layout (one-WG path) uses:
+    //   max(reorder_buffer, bin_hist + incoming_offsets)
+    //
+    // Reorder buffer for keys (and optionally values)
+    std::size_t reorder_size = sizeof(KeyT) * param.data_per_workitem * param.workgroup_size;
+    if constexpr (!std::is_void_v<ValueT>)
+        reorder_size += sizeof(ValueT) * param.data_per_workitem * param.workgroup_size;
+    // Model bin histogram and incoming offsets for 8-bit radix (256 bins).
+    // This mirrors the minimum structure used in the ESIMD one-WG kernel.
+    using _OffsetT = std::uint32_t;
+    constexpr std::size_t bin_count = 1u << 8; // 8 radix bits
+    const std::size_t bin_hist_size = bin_count * sizeof(_OffsetT);
+    const std::size_t incoming_offsets_size = bin_count * sizeof(_OffsetT);
+    const std::size_t slm_alloc_size =
+        std::max(reorder_size, bin_hist_size + incoming_offsets_size);

	if (${_variant} STREQUAL "sycl")
	if ("${_variant}" STREQUAL "sycl")

	# ~1-2 test: (10/1000) * (16 * 6 * 2)
	# ~1-2 test: (2/1000) * (16 * 6 * 2)

Conversation

mmichel11 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mmichel11 commented Feb 6, 2026 •

edited

Loading