Releases · NVIDIA/cccl

20 Apr 18:06

github-actions

Immutable

v3.3.3

af8cce4

v3.3.3 Latest

Latest

What's Changed

🔄 Other Changes

Bump branch/3.3.x to 3.3.3. by @wmaxey in #8409
[Backport branch/3.3.x] [libcu++] Add missing braces supression to other mempool types by @github-actions[bot] in #8166
[Backport branch/3.3.x] Fix order of _CCCL_API and CCCL_DEPRECATED by @github-actions[bot] in #8390
[backport 3.3] Fix family arch specific feature detection in <nv/target> (#8027) by @davebayer in #8294
[Backport branch/3.3.x] Fix codegen in 128bit atomic CAS by @github-actions[bot] in #8408
[Backport branch/3.3.x] [libcu++] Add missing bit_cast in the buffer construction (#8420) by @pciolkosz in #8425

Full Changelog: v3.3.2...v3.3.3

Contributors

pciolkosz, davebayer, and wmaxey

Assets 7

14 Apr 15:19

github-actions

Immutable

v3.3.2

8768676

v3.3.2

What's Changed

🔄 Other Changes

Bump branch/3.3.x to 3.3.2. by @wmaxey in #7992
[Backport to 3.3]: Support non-copyable stream types in DeviceTransform (#7915) by @bernhardmgruber in #8011
[Backport branch/3.3.x] Support DLPack inclusion for both <dlpack/dlpack.h> and <dlpack.h> by @github-actions[bot] in #7910
[Backport branch/3.3.x] Add fallback for _CCCL_BUILTIN_EXPECT by @github-actions[bot] in #8049
[Backport 3.3] reformulate __as_type_list to avoid MSVC overload resolution bug (#7991) by @miscco in #8062
[Backport 3.3] Avoid deprecation warning with is_always_equal (#7674) by @miscco in #8078
[Backport branch/3.3.x] Fix use of EXPAND in token concatenation by @github-actions[bot] in #8077

Full Changelog: v3.3.1...v3.3.2

Contributors

bernhardmgruber, miscco, and wmaxey

Assets 7

09 Apr 13:27

shwina

Immutable

python-0.6.0

318bef7

CCCL Python Libraries v0.6.0

These are the release notes for the cuda-cccl Python package version 0.6.0, dated April 9th, 2026. The previous release was v0.5.1.

cuda-cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

API breaking changes

cuda.coop refactored to use maker factory functions (#7713)

Features

ShuffleIterator — New iterator type added to cuda.compute (#7721)
max_segment_size guarantee — Exposed in the public API (#8284)
LTO-IR support — Can now directly pass LTO-IR for custom operators (#7625)
Numba-optional install — Added a path to install cuda.compute without Numba as a dependency (#7633)

Performance

Faster TransformIterator construction (#7660)

Bug Fixes

Fix faulty pointer arithmetic in CUB dispatch (#7940)
Fix merge sort returning negative temp storage bytes (#7916)
Fix histogram build object caching when using privatized smem strategy (#7657)

Assets 3

14 Apr 15:19

github-actions

Immutable

v3.3.1

c262ef4

v3.3.1

What's Changed

🔄 Other Changes

Bump 3.3.0 to 3.3.1. by @wmaxey in #7742
[Backport 3.3] #7787 and #7738 by @miscco in #7800
[Backport 3.3]: Avoid use of class static variable in device function (#7776) by @miscco in #7825
[Backport branch/3.3.x] Forward policy hub from dispatch_streaming_arg_reduce_t to reduce::dispatch by @github-actions[bot] in #7814
[Backport branch/3.3.x] cub: change {Lower,Upper}Bound to accept iterator and number of elements. by @github-actions[bot] in #7816
[Backport branch/3.3.x] Fix version guard for cudaDevAttrHostNumaMemoryPoolsSupported by @github-actions[bot] in #7842
[Backport 3.3] Buffer changes by @miscco in #7841
[Backport branch/3.3.x] [libcu++] Change default pool getters to return memory_pool_ref& by @github-actions[bot] in #7858
[Backport branch/3.3.x] Avoid compile issue with __iset by @github-actions[bot] in #7879
[Backport to 3.3] Require CUDA 12.9 for host numa implementation of pinned memory pool (#7856) by @pciolkosz in #7872
[Backport 3.3] Avoid GCC bug with dependent type template (#7857) by @miscco in #7860

Full Changelog: v3.3.0...v3.3.1

Contributors

miscco, pciolkosz, and wmaxey

Assets 7

27 Feb 22:39

github-actions

Immutable

v3.3.0

09094af

v3.3.0

Full Changelog: v3.3.0...v3.3.0

What's Changed

📚 Libcudacxx

[libcudacxx] Fix a typo in the documentation by @caugonnet in #7330
Add a test for <nv/target> to validate old dialect support. by @wmaxey in #7241

🔄 Other Changes

Implement cudax::cufile by @davebayer in #6122
Update linear_congruential_generator with constexpr, tests and a fast discard by @RAMitchell in #6402
Replace _CCCL_HAS_CUDA_COMPILER() with _CCCL_CUDA_COMPILATION() by @davebayer in #6399
Remove unnecessary casts in complex multiplication/division by @davebayer in #6670
Add benchmark batch script by @bernhardmgruber in #6661
Improvements and testing for inspect_changes CI functionality. by @alliepiper in #6535
Improve clarity of CCCL assert macro documentation by @jrhemstad in #6675
Fix oversubscription issue with lit precompile, label hack by @alliepiper in #6554
Make missing sccache nonfatal. by @alliepiper in #6582
Address pending comments for make_tma_descriptor by @fbusato in #6662
Add nvhpc 25.9. by @alliepiper in #6003
Test building for all arches. by @alliepiper in #6113
Add nvbench_helper tests to CI. by @alliepiper in #6679
Add more targets to pytorch build. by @alliepiper in #6685
Add host std lib version detection by @davebayer in #6678
Improve CUB benchmark docs by @bernhardmgruber in #6640
Use if consteval in libcu++ by @davebayer in #6424
Update docs for _CCCL_IF_CONSTEVAL by @davebayer in #6692
Fixes issue with select close to int_max by @elstehle in #6641
Update libcudacxx C++ dialect handling. by @alliepiper in #6693
Simplifies env usage in DeviceTopK tests by @elstehle in #6680
Switch to S3 preprocessor cache by @alliepiper in #6561
fix omp scan bug by @charan-003 in #6560
Refactor out variant from transform tunings by @bernhardmgruber in #6669
[libcu++] Waive hierarchy constexpr testing on GCC8 by @pciolkosz in #6707
Use wrapper with void* argument types for iterator advance/dereference signature by @shwina in #6634
Restore libcudacxx dialect presets. by @alliepiper in #6705
Refactor error handling in radix sort dispatch by @bernhardmgruber in #6681
Remove special dialect handling from cudax build system. by @alliepiper in #6702
Segmented scan followup by @oleksandr-pavlyk in #6706
Fix electing leader from any group in cuda::memcpy_async by @bernhardmgruber in #6710
Avoid scaling twice in ReduceNondeterministicPolicy by @bernhardmgruber in #6711
Remove special handling of C++ dialect in CUB's build system by @alliepiper in #6713
[libcu++] Use resource test fixture members through this by @pciolkosz in #6717
Improves top-k examples to illustrate stream usage by @elstehle in #6723
Tweak sol.py a bit by @bernhardmgruber in #6721
Implement PCG64 as extension by @RAMitchell in #6292
Use PDL in cub::DeviceScan by @bernhardmgruber in #6639
Fix header in libcudacxx test by @alliepiper in #6726
Remove dead code. by @alliepiper in #6725
Add deps on thrust/cub to libcudacxx. by @alliepiper in #6694
Remove special handling for dialect in Thrust's build system. by @alliepiper in #6722
[libcu++] Automatically bump up the release threshold of default mempools by @pciolkosz in #6718
Backport cuda::std::reference_wrapper C++20 features by @davebayer in #6709
Relax error tolerance for deterministic_device_reduce (RFA) test by @srinivasyadav18 in #6720
[DOC] Add temp_storage_bytes usage guide by @Aminsed in #6208
Improve charconv test compile times by @davebayer in #6687
Move source location builtins directly to <cuda/std/source_location> by @davebayer in #6738
Small improvements for cuda::ipow by @davebayer in #6736
Add support for clang's alignment builtins by @davebayer in #6741
Disable test that is failing in multiple configurations by @miscco in #6745
Implement std::normal_distribution by @RAMitchell in #6585
Update cuda::std::span concepts by @davebayer in #6744
Improve bit builtins support by @davebayer in #6737
Implement ranges::drop_view by @miscco in #5049
Improve fp decompose by @davebayer in #6749
Enable caching of advance/dereference methods for Zipiterator and PermutationIterator by @shwina in #6753
implement indeterminate_domain from P3826R2 by @ericniebler in #6628
Fix cuda::std::reference_wrapper noexcept test with gcc-8 by @davebayer in #6757
cuda.compute: In TransformIterator, use type annotations (if available) to determine the output type of user-provided op by @shwina in #6760
cuda.compute: Fixes and improvements to function caching by @shwina in #6758
Fix __throw_cuda_error availability with nvrtc by @davebayer in #6759
Implement ranges::find_if and ranges::find_if_not by @miscco in #6752
Fix radix_sort tuning namespace by @bernhardmgruber in #6755
[libcu++] Add sm_62 arch traits by @pciolkosz in #6772
fix(readme): Update broken Godbolt example link by @miyanyan in #6773
Implement CUDA backend for parallel cuda::std::for_each by @miscco in #5610
Ensure that we properly warn about device lambdas that need to query the return type by @miscco in #6765
Add missing test for thrust::reduce_into by @Pansysk75 in #6572
cuda.compute: Add select algorithm based on three_way_partition by @shwina in #6766
Add queries for CUB ptx version as arch_id by @bernhardmgruber in #6776
Add operator<< to some CUB enums by @bernhardmgruber in #6774
cuda.compute: Fix caching of functions that call other functions by @shwina in #6770
Implement std::exponential_distribution by @RAMitchell in #6750
Fix issue with libcudacxx header tests. by @alliepiper in #6785
Add a type and operation enum to CUB by @bernhardmgruber in #6780
Use conventional order of _CCCL_API friend consistently by @miscco in #6781
Implement std::binomial_distribution by @RAMitchell in #6747
Fixes i32 overflow for benchmark data generation of more than INT_MAX number of items by @elstehle in #6809
Temporarily add upper bound to numba-cuda dependency by @shwina in #6815
Make cuda capabilities part of cccl config by @davebayer in #6806
Update std::uniform_real_distribution by @RAMitchell in #6798
[cub] Implement cub::MaxPotentialDynamicSmemBytes by @davebayer in #6818
libcudacxx: streamline simple trait aliases by @Aminsed in #6740
Fix a typo in compute.rst by @shwina in #6826
Improve our WarpReduce implementation by @miscco in #6814
Implement cuda::sincos by @davebayer in #6742
Replace inline ptx with intrinsics by @davebayer in https://github.com/NVIDIA/c...

Contributors

msarahan, alliepiper, and 35 other contributors

Assets 7

12 Feb 01:03

github-actions

Immutable

v3.2.1

d84981c

v3.2.1

Full Changelog: v3.2.1...v3.2.1

What's Changed

🔄 Other Changes

Bump branch/3.2.x to 3.2.1. by @wmaxey in #7329
[Backport branch/3.2.x] Add accessor methods to shared_resource by @github-actions[bot] in #7322
[Backport branch/3.2.x] Fix clang warning about missing braces again by @github-actions[bot] in #7324
[Backport branch/3.2.x] part deux: make the abi of __basic_any compatible between c++17 and c++20 by @github-actions[bot] in #7421
[backport 3.2] Fix missing c2h symbol when compiling with clang-cuda (#7454) by @davebayer in #7600
[Backport branch/3.2.x] Remove recursion from __internal_is_address_from by @github-actions[bot] in #7573
[Backport branch/3.2.x] Fix ranges_overlap for nvc++ -cuda by @github-actions[bot] in #7598
[Backport branch/3.2.x] Fix cuda::device::current_arch_id by @github-actions[bot] in #7601
[Backport branch/3.2.x] Check for _GLIBCXX_USE_CXX11_ABI only when compiling with libstdc++ by @github-actions[bot] in #7630
[Backport branch/3.2.x] Fix cuda::barrier missing accounting of results in try_wait by @github-actions[bot] in #7634

Full Changelog: v3.2.0...v3.2.1

Contributors

davebayer and wmaxey

Assets 7

07 Feb 10:24

shwina

Immutable

python-0.5.1

37dc08c

CCCL Python Libraries (v0.5.1)

These are the release notes for the cuda-cccl Python package version 0.5.1, dated February 6th, 2026. The previous release was v0.5.0.

cuda-cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features

Improvements

Restrict to numba-cuda less than 0.27 (#7529)

Bug Fixes

Fix caching of functions referencing numpy ufuncs (#7535)

Assets 3

05 Feb 14:38

shwina

Immutable

python-0.5.0

1836859

CCCL Python Libraries (v0.5.0)

These are the release notes for the cuda-cccl Python package version 0.5.0, dated February 5th, 2026. The previous release was v0.4.5.

cuda-cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

⚠️ Breaking change

Object-based API requires passing operator to algorithm `call` method

This API change affects only users of the object-based API (expert mode).

Previously, constructing an algorithm object required passing the operator as an argument, but invoking it did not:

# step 1: create algorithm object
transformer = cuda.compute.make_unary_transform(d_input, d_output, some_unary_op)

# step 2: invoke algorithm
transformer(d_in1, d_out1, num_items1)  # NOTE: not passing some_unary_op here

The new behaviour requires passing it in both places:

# step 1: create algorithm object
transformer = cuda.compute.make_unary_transform(d_input, d_output, some_unary_op)

# step 2: invoke algorithm
transformer(d_in1, d_out1, some_unary_op, num_items1)  # NOTE: need to pass some_unary_op here

This change is introduced because in many situations (such as in a loop), the operator itself and the globals/closures it references can change between construction and invocation (or between invocations).

Features

Improvements

Avoid unnecessary recompilation of stateful operators (#7500)
Improved cache lookup performance (#7501)

Bug Fixes

Fix handling of boolean types in cuda.compute (#7389)

Assets 3

05 Feb 21:55

github-actions

Immutable

v3.2.0

477f8bc

v3.2.0

The CCCL team is excited to announce the 3.2 release of the CUDA Core Compute Library (CCCL) whose highlights include new modern CUDA C++ runtime APIs and new speed-of-light algorithms including Top-K.

Modern CUDA C++ Runtime

CCCL 3.2 broadly introduces new, idiomatic C++ interfaces for core CUDA runtime and driver functionality.

If you’ve written CUDA C++ for a while, you’ve likely built (or adopted) some form of convenience wrappers around today’s C-like APIs like cudaMalloc or cudaStreamCreate.

The new APIs added in CCCL 3.2 are meant to provide the productivity and safety benefits of C++ for core CUDA constructs so you can spend less time reinventing wrappers and more time writing kernels and algorithms.

Highlights:

New convenient vocabulary types for core CUDA concepts (cuda::stream, cuda::event, cuda::arch_traits)
Easier memory management with Memory Resources and cuda::buffer
More powerful and convenient kernel launch with cuda::launch

Example (vector add, revisited):

cuda::device_ref device = cuda::devices[0];
cuda::stream stream{device};
auto pool = cuda::device_default_memory_pool(device);

int num_elements = 1000;
auto A = cuda::make_buffer<float>(stream, pool, num_elements, 1.0);
auto B = cuda::make_buffer<float>(stream, pool, num_elements, 2.0);
auto C = cuda::make_buffer<float>(stream, pool, num_elements, cuda::no_init);

constexpr int threads_per_block = 256;
auto config = cuda::distribute<threads_per_block>(num_elements);
auto kernel = [] __device__ (auto config, cuda::std::span<const float> A, 
                                            cuda::std::span<const float> B, 
                                            cuda::std::span<float> C){
    auto tid = cuda::gpu_thread.rank(cuda::grid, config);
    if (tid < A.size())
        C[tid] = A[tid] + B[tid];
};
cuda::launch(stream, config, kernel, config, A, B, C);

(Try this example live on Compiler Explorer!)

A forthcoming blog post will go deeper into the details, the design goals, intended usage patterns, and how these new APIs fit alongside existing CUDA APIs.

New Algorithms

Top-K Selection

CCCL 3.2 introduces cub::DeviceTopK (for example, cub::DeviceTopK::MaxKeys) to select the K largest (or smallest) elements without sorting the entire input. For workloads where K is small, this can deliver up to 5X speedups over a full radix sort, and can reduce memory consumption when you don’t need sorted results.

Top‑K is an active area of ongoing work for CCCL: our roadmap includes planned segmented Top‑K as well as block‑scope and warp‑scope Top‑K variants. See what’s planned and tell us what Top‑K use cases matter most in CCCL GitHub issue #5673.

Fixed-size Segmented Reduction

CCCL 3.2 now provides a new cub::DeviceSegmentedReduce variant that accepts a uniform segment_size, eliminating offset iterator overhead in the common case when segments are fixed-size. This enables optimizations for both small segment sizes (up to 66x) and large segment sizes (up to 14x).

// New API accepts fixed segment_size instead of per-segment begin/end offsets
cub::DeviceSegmentedReduce::Sum(d_temp, temp_bytes, input, output,  
                                num_segments, segment_size);

Additional New Algorithms in CCCL 3.2

Segmented Scan - cub::DeviceSegmentedScan provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments.

Binary Search - cub::DeviceFind::[Upper/LowerBound] performs a parallel search for multiple values in an ordered sequence.

Search - cub::DeviceFind::FindIf searches the unordered input for the first element that satisfies a given condition. Thanks to its early-exit logic, it can be up to 7x faster than searching the entire sequence.

Full Changelog: v3.1.4...v3.2.0

What's Changed

🚀 Thrust / CUB

Modified test [reduce][nondeterministic] per gh-5443 by @oleksandr-pavlyk in #5451
Remove unused include of grid/grid_queue from CUB agent/dispatch headers by @oleksandr-pavlyk in #5887
[CUB] Implement BlockLoadToShared by @pauleonix in #5780
Fix debug section around line 390 of dispatch_topk by @oleksandr-pavlyk in #6152
Fix typos in segmented reduce by @oleksandr-pavlyk in #6153
Device scan doc fixes by @oleksandr-pavlyk in #6294
Scan tests and benchmarks by @oleksandr-pavlyk in #6355
[Thrust]: New "sum rows" and "sum columns" examples by @brycelelbach in #4462
Added new CUB APIs: DeviceTransform::Fill #5526, DeviceTransform::Generate #5890, DeviceTransform::TransformIf #5198, which are used by thrust::fill[_n] #5805, thrust::uninitialized_fill #5813, thrust::generate[_n] #5807, and thrust::transform_if, thrust::scatter_if #5952, and non-trivial thrust::copy #5954. By @bernhardmgruber.
Made thrust::tabulate #6012 use cub::DeviceTransform as well by @bernhardmgruber in #5198

libcu++

Added cuda::barrier and cuda::memcpy_async_tx examples using TMA @bernhardmgruber in #6231
Waiting on a cuda::barrier on SM90+ is now faster and produces less code @bernhardmgruber in #6007
Improve cuda::memcpy_async codegen @bernhardmgruber in #5996
Improve TMA codegen on sm120 in cuda::memcpy_async, cuda::device::memcpy_async_tx, cub::DeviceTransform @bernhardmgruber in #6362

🤝 cuda.coop

Implement cuda.coop striped_to_blocked. by @tpn in #4662

🔄 Other Changes

Rework our signbit implementation to be potentially constexpr by @miscco in #5259
[CUDAX->libcu++] Move ensure_current_device to libcu++ and change the name to ensure_current_context by @pciolkosz in #5285
[Version] Update main to v3.2.0 by @github-actions[bot] in #5286
Rework our copysign implementation to be potentially constexpr by @miscco in #5287
Update NVBench by @bernhardmgruber in #5288
[CUDAX] Rename async_buffer::change_stream to set_stream and add a test by @pciolkosz in #5273
Extend and refactor transform overloads in CUDA system by @bernhardmgruber in #5238
Refactor c2h by @bernhardmgruber in #5205
Fix inplace_vector out of bounds access for at() by @Jacobfaib in #5295
Fix cudax test breaking main by @davebayer in #5301
[STF] Move occupancy calculation utility and support CUfunction by @caugonnet in #5236
[CUDAX->libcu++] Move stream and event from cudax to libcu++ by @pciolkosz in #5293
Port thrust::transform_input_output_iterator to cuda by @miscco in #5113
Implement format.arguments and format.context from standard formatting library by @davebayer in #5217
Initial migration of cuco hasher to cudax by @srinivasyadav18 in #4898
CUB - Add internal integer utils and tests (Split WarpReduce PR) by @fbusato in #5314
Skip zero values in fast_mod_div unit test by @fbusato in #5307
Fix cuda::static_for noexcept definition by @davebayer in #5303
Add sm90 tunings for RFA F32 by @srinivasyadav18 in #5269
Add and use new artifact/workflow functionality for CI scripts. by @alliepiper in #4861
Add gitlab devcontainers by @wmaxey in #5325
Remove mentions of CUDA experimental that sneaked into libcu++ by @pciolkosz in #5306
Add a macro to disable PDL by @bernhardmgruber in #5316
Move aligned_size_t, get_device_address and discard_memory to cuda/__memory/ by @davebayer in #5239
Adds tests for large number of items to DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #5251
[libcu++] Deprecate default stream_ref constructor an...