CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer

**Describe the bug**
I just tried to get some sample code for https://github.com/Project-MONAI/MONAI/issues/6626 but ran into a warning I have seen many times before. The problem appears when the transform pushed code the GPU and the data is then handed over from the Dataloader Thread to the main Thread.
This is no hard bug but it is very annoying since it gets spammed a lot.
Temporary workaround which I found is to add "persistent_workers=True," to the DataLoader, then the warning gets only shown at the end of the program, sometimes never.

Warning message: 
```
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)

```

**To Reproduce**
Run this code, minimal sample:

```python
import torch
from torch import optim, nn
from monai.engines import SupervisedTrainer
from monai.data import DataLoader, ArrayDataset
import gc
from monai.networks.nets import UNet
from monai.inferers import SimpleInferer, SlidingWindowInferer
from monai.networks.nets.dynunet import DynUNet

from monai.engines import SupervisedEvaluator, SupervisedTrainer
import monai.transforms as mt


NETWORK_INPUT_SHAPE = (1, 128, 128, 256)
NUM_IMAGES = 50

def get_xy():
    xs = [256 * torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    ys = [torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    return xs, ys

transform = mt.Compose([
    mt.ToDevice(device="cuda")
])

def get_data_loader():
    x, y = get_xy()
    dataset = ArrayDataset(x, seg=y, img_transform=transform, seg_transform=transform)

    loader = DataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn')
    return loader


def get_model():
    return DynUNet(
            spatial_dims=3,
            in_channels=1,
            out_channels=1,
            kernel_size=[3, 3, 3, 3, 3 ,3],
            strides=[1, 2, 2, 2, 2, [2, 2, 1]],
            upsample_kernel_size=[2, 2, 2, 2, [2, 2, 1]],
            norm_name="instance",
            deep_supervision=False,
            res_block=True,
).to(device=device)

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader = get_data_loader()
    model = get_model()
    MAX_EPOCHS = 2

    optimizer = optim.Adam(model.parameters())
    inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")

    trainer = SupervisedTrainer(
        device=device,
        max_epochs=MAX_EPOCHS,
        amp=True,
        train_data_loader=train_loader,
        network=model,
        optimizer=optimizer,
        inferer=inferer,
        loss_function=nn.CrossEntropyLoss(),
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
    )

    trainer.run()
```


**Expected behavior**
No Cuda Warnings 

**Environment**

Verified on different environments.
```
================================
Printing MONAI config...
================================
MONAI version: 1.1.0
Numpy version: 1.23.5
Pytorch version: 1.13.1+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: a2ec3752f54bfc3b40e7952234fbeb5452ed63e3
MONAI __file__: /home/matteo/anaconda3/envs/monai/lib/python3.9/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.10
Nibabel version: 5.0.1
scikit-image version: 0.20.0
Pillow version: 9.5.0
Tensorboard version: 2.12.1
gdown version: 4.7.1
TorchVision version: 0.14.0+cu117
tqdm version: 4.64.1
lmdb version: 1.4.0
psutil version: 5.9.4
pandas version: 1.5.3
einops version: 0.6.0
transformers version: 4.21.3
mlflow version: 2.2.2
pynrrd version: 1.0.0

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies


================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 22.04.2 LTS
Platform: Linux-5.19.0-45-generic-x86_64-with-glibc2.35
Processor: x86_64
Machine: x86_64
Python version: 3.9.16
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 12
Num logical CPUs: 24
Num usable CPUs: 24
CPU usage (%): [4.1, 3.6, 4.2, 3.6, 3.6, 3.7, 3.6, 3.1, 3.6, 4.1, 5.2, 99.5, 4.1, 3.6, 4.6, 3.6, 3.6, 3.6, 3.6, 3.6, 3.6, 5.2, 3.6, 4.2]
CPU freq. (MHz): 3687
Load avg. in last 1, 5, 15 mins (%): [6.3, 7.6, 7.5]
Disk usage (%): 24.8
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 31.2
Available memory (GB): 26.9
Used memory (GB): 3.9

================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 11.7
cuDNN enabled: True
cuDNN version: 8500
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
GPU 0 Name: NVIDIA GeForce RTX 3090 Ti
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 84
GPU 0 Total memory (GB): 22.2
GPU 0 CUDA capability (maj.min): 8.6

```

**Additional context**
Adding an evaluator further complicates the warnings and a new warning is now shown:
```
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

```

The code for that:
```python
import torch
from torch import optim, nn
from monai.engines import SupervisedTrainer
from monai.data import DataLoader, ArrayDataset
import gc
from monai.networks.nets import UNet
from monai.inferers import SimpleInferer, SlidingWindowInferer
from monai.networks.nets.dynunet import DynUNet

from monai.handlers import (
    CheckpointSaver,
    LrScheduleHandler,
    MeanDice,
    StatsHandler,
    TensorBoardStatsHandler,
    ValidationHandler,
    from_engine,
    GarbageCollector,
)
from monai.engines import SupervisedEvaluator, SupervisedTrainer
import monai.transforms as mt


NETWORK_INPUT_SHAPE = (1, 128, 128, 256)
NUM_IMAGES = 50

def get_xy():
    xs = [256 * torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    ys = [torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    return xs, ys

transform = mt.Compose([
    mt.ToDevice(device="cuda")
])

def get_data_loader():
    x, y = get_xy()
    dataset = ArrayDataset(x, seg=y, img_transform=transform, seg_transform=transform)

    loader = DataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn')
    return loader


def get_model():
    return DynUNet(
            spatial_dims=3,
            # 1 dim for the image, the other ones for the signal per label with is the size of image
            in_channels=1,
            out_channels=1,
            kernel_size=[3, 3, 3, 3, 3 ,3],
            strides=[1, 2, 2, 2, 2, [2, 2, 1]],
            upsample_kernel_size=[2, 2, 2, 2, [2, 2, 1]],
            norm_name="instance",
            deep_supervision=False,
            res_block=True,
).to(device=device)

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader = get_data_loader()
    model = get_model()
    MAX_EPOCHS = 2

    optimizer = optim.Adam(model.parameters())
    inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")
    val_inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")

    val_handlers = [
        StatsHandler(output_transform=lambda x: None),
    ]

    evaluator = SupervisedEvaluator(
        device=device,
        amp=True,
        val_data_loader=train_loader,
        network=model,
        inferer=val_inferer,
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
        val_handlers = val_handlers,
    )
    lr_scheduler =  torch.optim.lr_scheduler.PolynomialLR(optimizer, total_iters=MAX_EPOCHS, power = 2)
    train_handlers = [
        ValidationHandler(
            validator=evaluator, interval=1, epoch_level=True,
        ),
        LrScheduleHandler(lr_scheduler=lr_scheduler, 
                print_lr=True,
        ),

    ]

    trainer = SupervisedTrainer(
        device=device,
        max_epochs=MAX_EPOCHS,
        amp=True,
        train_data_loader=train_loader,
        network=model,
        optimizer=optimizer,
        inferer=inferer,
        loss_function=nn.CrossEntropyLoss(),
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
        train_handlers=train_handlers,
    )
    trainer.run()
```





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6636

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6636

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions