Skip to content

Releases: NVIDIA/cccl

v3.3.3

20 Apr 18:06
Immutable release. Only release title and notes can be modified.
af8cce4

Choose a tag to compare

What's Changed

🔄 Other Changes

  • Bump branch/3.3.x to 3.3.3. by @wmaxey in #8409
  • [Backport branch/3.3.x] [libcu++] Add missing braces supression to other mempool types by @github-actions[bot] in #8166
  • [Backport branch/3.3.x] Fix order of _CCCL_API and CCCL_DEPRECATED by @github-actions[bot] in #8390
  • [backport 3.3] Fix family arch specific feature detection in <nv/target> (#8027) by @davebayer in #8294
  • [Backport branch/3.3.x] Fix codegen in 128bit atomic CAS by @github-actions[bot] in #8408
  • [Backport branch/3.3.x] [libcu++] Add missing bit_cast in the buffer construction (#8420) by @pciolkosz in #8425

Full Changelog: v3.3.2...v3.3.3

v3.3.2

14 Apr 15:19
Immutable release. Only release title and notes can be modified.
8768676

Choose a tag to compare

What's Changed

🔄 Other Changes

  • Bump branch/3.3.x to 3.3.2. by @wmaxey in #7992
  • [Backport to 3.3]: Support non-copyable stream types in DeviceTransform (#7915) by @bernhardmgruber in #8011
  • [Backport branch/3.3.x] Support DLPack inclusion for both <dlpack/dlpack.h> and <dlpack.h> by @github-actions[bot] in #7910
  • [Backport branch/3.3.x] Add fallback for _CCCL_BUILTIN_EXPECT by @github-actions[bot] in #8049
  • [Backport 3.3] reformulate __as_type_list to avoid MSVC overload resolution bug (#7991) by @miscco in #8062
  • [Backport 3.3] Avoid deprecation warning with is_always_equal (#7674) by @miscco in #8078
  • [Backport branch/3.3.x] Fix use of EXPAND in token concatenation by @github-actions[bot] in #8077

Full Changelog: v3.3.1...v3.3.2

CCCL Python Libraries v0.6.0

09 Apr 13:27
Immutable release. Only release title and notes can be modified.
318bef7

Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.6.0, dated April 9th, 2026. The previous release was v0.5.1.

cuda-cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

API breaking changes

  • cuda.coop refactored to use maker factory functions (#7713)

Features

  • ShuffleIterator — New iterator type added to cuda.compute (#7721)
  • max_segment_size guarantee — Exposed in the public API (#8284)
  • LTO-IR support — Can now directly pass LTO-IR for custom operators (#7625)
  • Numba-optional install — Added a path to install cuda.compute without Numba as a dependency (#7633)

Performance

  • Faster TransformIterator construction (#7660)

Bug Fixes

  • Fix faulty pointer arithmetic in CUB dispatch (#7940)
  • Fix merge sort returning negative temp storage bytes (#7916)
  • Fix histogram build object caching when using privatized smem strategy (#7657)

v3.3.1

14 Apr 15:19
Immutable release. Only release title and notes can be modified.
c262ef4

Choose a tag to compare

What's Changed

🔄 Other Changes

  • Bump 3.3.0 to 3.3.1. by @wmaxey in #7742
  • [Backport 3.3] #7787 and #7738 by @miscco in #7800
  • [Backport 3.3]: Avoid use of class static variable in device function (#7776) by @miscco in #7825
  • [Backport branch/3.3.x] Forward policy hub from dispatch_streaming_arg_reduce_t to reduce::dispatch by @github-actions[bot] in #7814
  • [Backport branch/3.3.x] cub: change {Lower,Upper}Bound to accept iterator and number of elements. by @github-actions[bot] in #7816
  • [Backport branch/3.3.x] Fix version guard for cudaDevAttrHostNumaMemoryPoolsSupported by @github-actions[bot] in #7842
  • [Backport 3.3] Buffer changes by @miscco in #7841
  • [Backport branch/3.3.x] [libcu++] Change default pool getters to return memory_pool_ref& by @github-actions[bot] in #7858
  • [Backport branch/3.3.x] Avoid compile issue with __iset by @github-actions[bot] in #7879
  • [Backport to 3.3] Require CUDA 12.9 for host numa implementation of pinned memory pool (#7856) by @pciolkosz in #7872
  • [Backport 3.3] Avoid GCC bug with dependent type template (#7857) by @miscco in #7860

Full Changelog: v3.3.0...v3.3.1

v3.3.0

27 Feb 22:39
Immutable release. Only release title and notes can be modified.
09094af

Choose a tag to compare

Full Changelog: v3.3.0...v3.3.0

What's Changed

📚 Libcudacxx

  • [libcudacxx] Fix a typo in the documentation by @caugonnet in #7330
  • Add a test for <nv/target> to validate old dialect support. by @wmaxey in #7241

🔄 Other Changes

Read more

v3.2.1

12 Feb 01:03
Immutable release. Only release title and notes can be modified.
d84981c

Choose a tag to compare

Full Changelog: v3.2.1...v3.2.1

What's Changed

🔄 Other Changes

  • Bump branch/3.2.x to 3.2.1. by @wmaxey in #7329
  • [Backport branch/3.2.x] Add accessor methods to shared_resource by @github-actions[bot] in #7322
  • [Backport branch/3.2.x] Fix clang warning about missing braces again by @github-actions[bot] in #7324
  • [Backport branch/3.2.x] part deux: make the abi of __basic_any compatible between c++17 and c++20 by @github-actions[bot] in #7421
  • [backport 3.2] Fix missing c2h symbol when compiling with clang-cuda (#7454) by @davebayer in #7600
  • [Backport branch/3.2.x] Remove recursion from __internal_is_address_from by @github-actions[bot] in #7573
  • [Backport branch/3.2.x] Fix ranges_overlap for nvc++ -cuda by @github-actions[bot] in #7598
  • [Backport branch/3.2.x] Fix cuda::device::current_arch_id by @github-actions[bot] in #7601
  • [Backport branch/3.2.x] Check for _GLIBCXX_USE_CXX11_ABI only when compiling with libstdc++ by @github-actions[bot] in #7630
  • [Backport branch/3.2.x] Fix cuda::barrier missing accounting of results in try_wait by @github-actions[bot] in #7634

Full Changelog: v3.2.0...v3.2.1

CCCL Python Libraries (v0.5.1)

07 Feb 10:24
Immutable release. Only release title and notes can be modified.
37dc08c

Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.5.1, dated February 6th, 2026. The previous release was v0.5.0.

cuda-cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features

Improvements

  • Restrict to numba-cuda less than 0.27 (#7529)

Bug Fixes

  • Fix caching of functions referencing numpy ufuncs (#7535)

CCCL Python Libraries (v0.5.0)

05 Feb 14:38
Immutable release. Only release title and notes can be modified.
1836859

Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.5.0, dated February 5th, 2026. The previous release was v0.4.5.

cuda-cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

⚠️ Breaking change

Object-based API requires passing operator to algorithm __call__ method

This API change affects only users of the object-based API (expert mode).

Previously, constructing an algorithm object required passing the operator as an argument, but invoking it did not:

# step 1: create algorithm object
transformer = cuda.compute.make_unary_transform(d_input, d_output, some_unary_op)

# step 2: invoke algorithm
transformer(d_in1, d_out1, num_items1)  # NOTE: not passing some_unary_op here

The new behaviour requires passing it in both places:

# step 1: create algorithm object
transformer = cuda.compute.make_unary_transform(d_input, d_output, some_unary_op)

# step 2: invoke algorithm
transformer(d_in1, d_out1, some_unary_op, num_items1)  # NOTE: need to pass some_unary_op here

This change is introduced because in many situations (such as in a loop), the operator itself and the globals/closures it references can change between construction and invocation (or between invocations).

Features

Improvements

  • Avoid unnecessary recompilation of stateful operators (#7500)
  • Improved cache lookup performance (#7501)

Bug Fixes

  • Fix handling of boolean types in cuda.compute (#7389)

v3.2.0

05 Feb 21:55
Immutable release. Only release title and notes can be modified.
477f8bc

Choose a tag to compare

The CCCL team is excited to announce the 3.2 release of the CUDA Core Compute Library (CCCL) whose highlights include new modern CUDA C++ runtime APIs and new speed-of-light algorithms including Top-K.

Modern CUDA C++ Runtime

CCCL 3.2 broadly introduces new, idiomatic C++ interfaces for core CUDA runtime and driver functionality.

If you’ve written CUDA C++ for a while, you’ve likely built (or adopted) some form of convenience wrappers around today’s C-like APIs like cudaMalloc or cudaStreamCreate.

The new APIs added in CCCL 3.2 are meant to provide the productivity and safety benefits of C++ for core CUDA constructs so you can spend less time reinventing wrappers and more time writing kernels and algorithms.

Highlights:

  • New convenient vocabulary types for core CUDA concepts (cuda::stream, cuda::event, cuda::arch_traits)
  • Easier memory management with Memory Resources and cuda::buffer
  • More powerful and convenient kernel launch with cuda::launch

Example (vector add, revisited):

cuda::device_ref device = cuda::devices[0];
cuda::stream stream{device};
auto pool = cuda::device_default_memory_pool(device);

int num_elements = 1000;
auto A = cuda::make_buffer<float>(stream, pool, num_elements, 1.0);
auto B = cuda::make_buffer<float>(stream, pool, num_elements, 2.0);
auto C = cuda::make_buffer<float>(stream, pool, num_elements, cuda::no_init);

constexpr int threads_per_block = 256;
auto config = cuda::distribute<threads_per_block>(num_elements);
auto kernel = [] __device__ (auto config, cuda::std::span<const float> A, 
                                            cuda::std::span<const float> B, 
                                            cuda::std::span<float> C){
    auto tid = cuda::gpu_thread.rank(cuda::grid, config);
    if (tid < A.size())
        C[tid] = A[tid] + B[tid];
};
cuda::launch(stream, config, kernel, config, A, B, C);

(Try this example live on Compiler Explorer!)

A forthcoming blog post will go deeper into the details, the design goals, intended usage patterns, and how these new APIs fit alongside existing CUDA APIs.

New Algorithms

Top-K Selection

CCCL 3.2 introduces cub::DeviceTopK (for example, cub::DeviceTopK::MaxKeys) to select the K largest (or smallest) elements without sorting the entire input. For workloads where K is small, this can deliver up to 5X speedups over a full radix sort, and can reduce memory consumption when you don’t need sorted results.

Top‑K is an active area of ongoing work for CCCL: our roadmap includes planned segmented Top‑K as well as block‑scope and warp‑scope Top‑K variants. See what’s planned and tell us what Top‑K use cases matter most in CCCL GitHub issue #5673.

image

Fixed-size Segmented Reduction

CCCL 3.2 now provides a new cub::DeviceSegmentedReduce variant that accepts a uniform segment_size, eliminating offset iterator overhead in the common case when segments are fixed-size. This enables optimizations for both small segment sizes (up to 66x) and large segment sizes (up to 14x).

// New API accepts fixed segment_size instead of per-segment begin/end offsets
cub::DeviceSegmentedReduce::Sum(d_temp, temp_bytes, input, output,  
                                num_segments, segment_size); 
image

Additional New Algorithms in CCCL 3.2

Segmented Scan - cub::DeviceSegmentedScan provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments.

Binary Search - cub::DeviceFind::[Upper/LowerBound] performs a parallel search for multiple values in an ordered sequence.

Search - cub::DeviceFind::FindIf searches the unordered input for the first element that satisfies a given condition. Thanks to its early-exit logic, it can be up to 7x faster than searching the entire sequence.

Full Changelog: v3.1.4...v3.2.0

What's Changed

🚀 Thrust / CUB

libcu++

🤝 cuda.coop

  • Implement cuda.coop striped_to_blocked. by @tpn in #4662

🔄 Other Changes

  • Rework our signbit implementation to be potentially constexpr by @miscco in #5259
  • [CUDAX->libcu++] Move ensure_current_device to libcu++ and change the name to ensure_current_context by @pciolkosz in #5285
  • [Version] Update main to v3.2.0 by @github-actions[bot] in #5286
  • Rework our copysign implementation to be potentially constexpr by @miscco in #5287
  • Update NVBench by @bernhardmgruber in #5288
  • [CUDAX] Rename async_buffer::change_stream to set_stream and add a test by @pciolkosz in #5273
  • Extend and refactor transform overloads in CUDA system by @bernhardmgruber in #5238
  • Refactor c2h by @bernhardmgruber in #5205
  • Fix inplace_vector out of bounds access for at() by @Jacobfaib in #5295
  • Fix cudax test breaking main by @davebayer in #5301
  • [STF] Move occupancy calculation utility and support CUfunction by @caugonnet in #5236
  • [CUDAX->libcu++] Move stream and event from cudax to libcu++ by @pciolkosz in #5293
  • Port thrust::transform_input_output_iterator to cuda by @miscco in #5113
  • Implement format.arguments and format.context from standard formatting library by @davebayer in #5217
  • Initial migration of cuco hasher to cudax by @srinivasyadav18 in #4898
  • CUB - Add internal integer utils and tests (Split WarpReduce PR) by @fbusato in #5314
  • Skip zero values in fast_mod_div unit test by @fbusato in #5307
  • Fix cuda::static_for noexcept definition by @davebayer in #5303
  • Add sm90 tunings for RFA F32 by @srinivasyadav18 in #5269
  • Add and use new artifact/workflow functionality for CI scripts. by @alliepiper in #4861
  • Add gitlab devcontainers by @wmaxey in #5325
  • Remove mentions of CUDA experimental that sneaked into libcu++ by @pciolkosz in #5306
  • Add a macro to disable PDL by @bernhardmgruber in #5316
  • Move aligned_size_t, get_device_address and discard_memory to cuda/__memory/ by @davebayer in #5239
  • Adds tests for large number of items to DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #5251
  • [libcu++] Deprecate default stream_ref constructor an...
Read more

CCCL Python Libraries (v0.4.5)

23 Jan 16:12
Immutable release. Only release title and notes can be modified.
0d8a35a

Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.4.5, dated January 23rd, 2026. The previous release was v0.4.4.

cuda-cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features

  • Add cuda.compute APIs for upper_bound and lower_bound (#7250)
  • Support lambdas as operators in cuda.compute (#7058)

Improvements

  • Consolidate caching logic across cuda.compute algorithms (#7281)
  • Allow multiple uses of the same function in one compilation (#7072)
  • Make cuda.compute importable in CPU-only environments (#7171)
  • Improve cuda.compute documentation (#7061)
  • Update Python package versioning flow (96f98db)

Bug Fixes

  • Fix deferred annotations handling (#7321, #7121)
  • Disable LDL/STL checks to avoid NVRTC 13.1 failures (#7054)
  • Fix documentation build issues (#7122)
  • Fix Python-related docs (#7052)