From the CUDA side, nearly all of the DL frameworks and array libraries that support dlpack use CUDA streams and stream order both their computations and memory allocations. In its current form, dlpack doesn't specify any synchronization semantics as well as doesn't have a way to specify information to allow for a producer-consumer pair to exchange the necessary information to continue to stream order computations.
I imagine there's a similar problem in other contexts as well (OpenCL, ROCm, etc.) where maybe it's possible to generalize an approach.
cc @oleksandr-pavlyk
From the CUDA side, nearly all of the DL frameworks and array libraries that support dlpack use CUDA streams and stream order both their computations and memory allocations. In its current form, dlpack doesn't specify any synchronization semantics as well as doesn't have a way to specify information to allow for a producer-consumer pair to exchange the necessary information to continue to stream order computations.
I imagine there's a similar problem in other contexts as well (OpenCL, ROCm, etc.) where maybe it's possible to generalize an approach.
cc @oleksandr-pavlyk