Skip to content

NVBit infinite loops when instrument a cuDNN function. #146

@jwnhy

Description

@jwnhy

I try to use NVBit to instrument the following tensorflow program.

import tensorflow as tf
from keras import layers
import os
os.environ["TF_DISABLE_RZ_CHECK"] = "1"
os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"
tf.keras.backend.set_image_data_format('channels_first')
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
tf.config.run_functions_eagerly(True)

tensor = tf.zeros([1, 2, 859043])
model = layers.Conv1D(filters=2, kernel_size=524287, strides=1, groups=2)
model(tensor)

print("DONE")

It stuck after launching the following kernel, which is a cuDNN kernel.

MEMTRACE: CTX 0x00000000050f8db0 - LAUNCH - Kernel pc 0x00007ff9a038f900 - Kernel name sm80_xmma_fprop_implicit_gemm_indexed_tf32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize64x64x64_stage4_warpsize1x4x1_g16_tensor16x8x8_execute_kernel__5x_cudnn - grid launch id 12 - grid size 1,5231,1 - block size 128,1,1 - nregs 166 - shmem 132096 - cuda stream id 1276264096

Also, after viewing the memory access pattern produced by NVBit, I found that a single address is accessed multiple times by the same CTA/wrap. Thus, I suspect that some sort of infinite loop is introduced by NVBit.

The program is OK if run without NVBit/compute-sanitizer; it finishes in a minute.

But it fails if it's instrumented with NVBit/compute-sanitizer.

Since both cuDNN and compute-sanitizer belongs to NVIDIA, I thought perhaps you could help on finding the root cause.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions