Skip to content

Add a rapids doctor check to verify CUDA toolkit libraries are findable and are version-consistent #139

@jayavenkatesh19

Description

@jayavenkatesh19

RAPIDS libraries can often install successfully via pip or conda but fail at runtime due to the underlying CUDA toolkit not being setup properly. Some scenarios which I have personally come across of this:

  • shared libraries (like libcudart.so or libnvrtc.so) are not findable at runtime. Pip wheels previously did not provide these (although with cupy 14, things looks much better). And any setup with a preinstalled CUDA toolkit could have an incorrect configuration.
  • CUDA toolkit version (either from pip/conda installations or CUDA_HOME and /usr/local/cuda symlink resolution) does not match with GPU driver's CUDA version.
  • The scenario above is further exacerbated by libraries hardcoding /usr/local/cuda as a fallback search path, so a stale symlink loads wrong libraries.

A check on rapids-cli can be added which checks for

  • discoverability of shared libraries via cuda-pathfinder
  • version consistency between these libraries found the cuda-pathfinder, the GPU driver, the /usr/local/cuda symlink and the CUDA_HOME/CUDA_PATH environment variables (if present). Mismatch on major versions is an automatic error but I am curious about what is the recommended approach if there is a mismatch for a minor version. Is it warning or should that be an error?

I think having this check fills in a very important gap in existing rapids doctor checks, and builds upon information which can be gathered by rapids debug.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions