[Runtime] Flush L2 cache in time eval#15305
[Runtime] Flush L2 cache in time eval#15305tqchen merged 1 commit intoapache:mainfrom spectrometerHBH:flush
Conversation
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment. Generated by tvm-bot |
|
Also CC: @yzh119 |
|
I suppose we already have a and you can reuse the |
I don't want it to be cuda only |
|
Yes the major concern is that L2 cache size is device specific, and later architectures may have L2 cache greater than 256mb |
|
To make it generalized. how about we instead introduce a l2_cache_flush_bytes, which default to 0, and use that as a parameter to indicate what array to allocate. This way it would generalize across GPUs as long as we set this argument right |
|
The implementation per se is not specific to L2 either. We could say it’s cache_flush_bytes |
|
cache_flush_bytes sounds good to me |
|
@tvm-bot rerun |
This PR introduces an optional cache flush functionality to `time_evaluator`. It is implemented by allocating two large empty NDArrays on the device so that the L2 cache are flushed. This gives us more accurate evaluation on the performance of a runtime function.
Followup of #15305 , this PR creates API to query device L2 cache size in bytes. Currently, the API-supported devices includes CUDA, OpenCL, and ROCM. Note that OpenCL's API does not return the accurate device L2 cache size. I cannot find a Vulkan API that returns L2 texture cache size, but the `vkCmdPipelineBarrier` call will flush the L2 texture cache automatically(https://zeux.io/2020/02/27/writing-an-efficient-vulkan-renderer/), thus we return 0 by default.
Followup of apache#15305 , this PR creates API to query device L2 cache size in bytes. Currently, the API-supported devices includes CUDA, OpenCL, and ROCM. Note that OpenCL's API does not return the accurate device L2 cache size. I cannot find a Vulkan API that returns L2 texture cache size, but the `vkCmdPipelineBarrier` call will flush the L2 texture cache automatically(https://zeux.io/2020/02/27/writing-an-efficient-vulkan-renderer/), thus we return 0 by default.
This PR introduces an optional cache flush functionality to `time_evaluator`. It is implemented by allocating two large empty NDArrays on the device so that the L2 cache are flushed. This gives us more accurate evaluation on the performance of a runtime function.
Followup of apache#15305 , this PR creates API to query device L2 cache size in bytes. Currently, the API-supported devices includes CUDA, OpenCL, and ROCM. Note that OpenCL's API does not return the accurate device L2 cache size. I cannot find a Vulkan API that returns L2 texture cache size, but the `vkCmdPipelineBarrier` call will flush the L2 texture cache automatically(https://zeux.io/2020/02/27/writing-an-efficient-vulkan-renderer/), thus we return 0 by default.
This PR introduces an optional cache flush functionality to `time_evaluator`. It is implemented by allocating two large empty NDArrays on the device so that the L2 cache are flushed. This gives us more accurate evaluation on the performance of a runtime function.
Followup of apache#15305 , this PR creates API to query device L2 cache size in bytes. Currently, the API-supported devices includes CUDA, OpenCL, and ROCM. Note that OpenCL's API does not return the accurate device L2 cache size. I cannot find a Vulkan API that returns L2 texture cache size, but the `vkCmdPipelineBarrier` call will flush the L2 texture cache automatically(https://zeux.io/2020/02/27/writing-an-efficient-vulkan-renderer/), thus we return 0 by default.
This PR introduces an optional cache flush functionality to `time_evaluator`. It is implemented by allocating two large empty NDArrays on the device so that the L2 cache are flushed. This gives us more accurate evaluation on the performance of a runtime function.
This PR introduces an optional cache flush functionality to
time_evaluator. It is implemented by allocating two large empty NDArrays on the device so that the L2 cache are flushed. This gives us more accurate evaluation on the performance of a runtime function.