Add a `rapids doctor` check to verify CUDA toolkit libraries are findable and are version-consistent

RAPIDS libraries can often install successfully via pip or conda but fail at runtime due to the underlying CUDA toolkit not being setup properly. Some scenarios which I have personally come across of this:

- shared libraries (like `libcudart.so` or `libnvrtc.so`) are not findable at runtime. Pip wheels previously did not provide these (although with cupy 14, things looks much better). And any setup with a preinstalled CUDA toolkit could have an incorrect configuration. 
- CUDA toolkit version (either from pip/conda installations or `CUDA_HOME` and `/usr/local/cuda` symlink resolution) does not match with GPU driver's CUDA version. 
- The scenario above is further exacerbated by libraries hardcoding `/usr/local/cuda` as a fallback search path, so a stale symlink loads wrong libraries. 

A check on `rapids-cli` can be added which checks for

- discoverability of shared libraries via `cuda-pathfinder`
- version consistency between these libraries found the `cuda-pathfinder`, the GPU driver, the `/usr/local/cuda` symlink and the `CUDA_HOME`/`CUDA_PATH` environment variables (if present). Mismatch on major versions is an automatic error but I am curious about what is the recommended approach if there is a mismatch for a minor version. Is it warning or should that be an error?

I think having this check fills in a very important gap in existing `rapids doctor` checks, and builds upon information which can be gathered by `rapids debug`. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `rapids doctor` check to verify CUDA toolkit libraries are findable and are version-consistent #139

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a rapids doctor check to verify CUDA toolkit libraries are findable and are version-consistent #139

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add a `rapids doctor` check to verify CUDA toolkit libraries are findable and are version-consistent #139