Skip to content

[Roadmap]: NCCL Roadmap Q1 2026 #1995

@gab9talavera

Description

@gab9talavera

NCCL Roadmap (Q1 2026)

This issue tracks planned NCCL development and releases for Q1 2026.
Plans are subject to change as the team iterates and receives feedback.
If you have suggestions for features, please open a feature request or comment below.


Recently Released: NCCL 2.29

Release Highlights

  • Device API Improvements: Adds versioned Device API structs, ncclCommQueryProperties, and host-accessible pointers from symmetric windows for better compatibility and feature discovery.
  • New One-Sided Host APIs: Introduces zero-SM one-sided operations (for example ncclPutSignal / ncclWaitSignal) over NVLink and network using copy engines and a CPU proxy.
  • NCCL4Py: Provides a Pythonic NCCL API with host collectives and P2P, CUDA Python interoperability, and automatic cleanup of NCCL-managed resources.
  • LLVM IR Support: Exposes NCCL Device APIs via LLVM IR bitcode so JITs, DSLs, and other compilers can generate NCCL-enabled kernels.
  • Hybrid LSA+GIN AllGather Kernel: Adds an NVLS multicast + ring symmetric kernel to improve AllGather performance and scalability when symmetric registration and GIN are available.
  • ncclCommGrow API: Enables dynamically growing communicators (together with ncclCommShrink) to support elastic training and recovery of failed ranks.
  • Multi-segment Registration: Allows multiple physical segments to back one virtual address range, enabling expandable segments and more flexible buffer layouts.
  • AllGatherV Scalability: Adds a scalable allgatherv path with new scheduling and kernels for better large-scale performance.
  • Debuggability & Observability: Enhances RAS and Inspector with real-time peer status, Prometheus output, and profiler support for CE-based collectives.

For a complete list and detailed release notes:


Q1 Roadmap – February ‘26–April ‘26

  • NCCL Windows Platform Support: Brings multi-GPU communication to Windows environments.
  • MPS Support: Enables efficient sharing and partitioning of GPU resources.
  • Elastic Buffer: Treats a large tensor as a multi‑segment window where a slice resides in GPU memory while the rest lives in host memory.
  • Dynamic offload of GPU memory: Moves communicator buffers from HBM to host when idle and restores them on reuse, improving memory‑bound throughput. Collaborating with Amem team https://github.com/inclusionAI/asystem-amem
  • GIN support with > 4 contexts: Scales GPU‑initiated networking per communicator so kernels can drive many QPs per NIC from a single communicator for higher throughput.
  • Failover on CX8: Adds opt‑in port failover on dual‑port ConnectX‑8, transparently rerouting NCCL traffic to a healthy port when links fail.
  • CE-based collective for CUDA Graph + send/recv: Enables CE-based collectives and point-to-point operations to be captured in CUDA graphs.
  • Profiler support for Device API: Adds visibility into device-initiated collectives so users can analyze and tune kernels that call the Device API.
  • NVLS + PAT: Extends PAT to multi-PPN NVLS scenarios for better performance on small and medium messages with multiple processes per node.
  • Device API timeouts: Introduces timeout handling for device-initiated operations to improve robustness and error reporting for long-running kernels.
  • Symmetric Kernels Tuning: Refines symmetric kernel parameters to reduce latency and improve throughput across message sizes.
  • ReduceScatter kernel using GIN: Adds a GIN-backed ReduceScatter kernel to lower latency on scale-out systems.
  • VA signals in GIN: Enables GIN to use virtual-address–based signals so kernels can signal and poll directly on data buffers.
  • Abort for symmetric kernels: Allows symmetric kernels to honor device abort flags so they can terminate cleanly when peers fail instead of hanging.
  • Symmetric Kernels with lamport support: Reworks symmetric kernels to use Lamport-style schemes, reducing synchronization overhead for low-latency collectives.
  • GIN support for rail-only systems: Enables efficient use of GPU-initiated networking on topologies without cross-rail connections.

Disclaimer: Some of these features will be released in an update.


Features Under Consideration

  • Python DSL support: Under scoping for integrations with Python DSLs such as CuTe DSL and cuTile.
  • SM‑initiated CE collectives: Explores enabling SM‑initiated collectives on copy engines for additional overlap of compute and communication.
  • RAS APIs: Allows RAS to extend past a separate binary (RAS Client) with APIs that can be integrated for observability and improved debugging in production.
  • User Buffer Support for PAT: Reduces SM usage and speeds up data transfers by enabling 1 PPN NET user buffer registration for PAT.
  • Receiver-side PXN: Improves topology-aware collectives by optimizing receive-side paths to reduce bottlenecks in all-to-all style traffic.
  • Unordered semantics for one-sided API: Extends one-sided on-stream operations with an unordered mode for transports that benefit from relaxed ordering.
  • SET signal operation: Adds a SET-style signal primitive in GIN so kernels can directly write signal values rather than only incrementing them.

Let us know how to improve or prioritize these features for distributed and multi-GPU workloads; contributions, feedback, and discussion are welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions