-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
I try to use NVBit to instrument the following tensorflow program.
import tensorflow as tf
from keras import layers
import os
os.environ["TF_DISABLE_RZ_CHECK"] = "1"
os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"
tf.keras.backend.set_image_data_format('channels_first')
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
tf.config.run_functions_eagerly(True)
tensor = tf.zeros([1, 2, 859043])
model = layers.Conv1D(filters=2, kernel_size=524287, strides=1, groups=2)
model(tensor)
print("DONE")It stuck after launching the following kernel, which is a cuDNN kernel.
MEMTRACE: CTX 0x00000000050f8db0 - LAUNCH - Kernel pc 0x00007ff9a038f900 - Kernel name sm80_xmma_fprop_implicit_gemm_indexed_tf32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize64x64x64_stage4_warpsize1x4x1_g16_tensor16x8x8_execute_kernel__5x_cudnn - grid launch id 12 - grid size 1,5231,1 - block size 128,1,1 - nregs 166 - shmem 132096 - cuda stream id 1276264096
Also, after viewing the memory access pattern produced by NVBit, I found that a single address is accessed multiple times by the same CTA/wrap. Thus, I suspect that some sort of infinite loop is introduced by NVBit.
The program is OK if run without NVBit/compute-sanitizer; it finishes in a minute.
But it fails if it's instrumented with NVBit/compute-sanitizer.
Since both cuDNN and compute-sanitizer belongs to NVIDIA, I thought perhaps you could help on finding the root cause.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels