Conversation
comaniac
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the insightful investigation :)
This could also illustrate the scenario of developing Collage that deals with backend placement (https://arxiv.org/pdf/2111.00655.pdf).
|
I'm having a conversion about strided dgrad performance compared to cuDNN. Will give more update before merging. |
|
HUGE UPDATE: Thanks to a tip from @manishucsd and @hwu36, it turns out upgrading the CUDA version from 11.3 to 11.6 alone gives 2x speedup on cutlass strided dgrad (unreal). Moreover, there was a critical bug in the parameter Here are the updated results after these two fixes: Now, cutlass is winning in ALL but one case in batch size 256, which is still 0.96 vs 0.94 difference. Note that activation fusion is not enabled for dgrad yet. So I expect the cutlass perf to be much better in practice for DL training use cases. |
|
The real world training would require fp32 accumulation. In that case, the kernel will be more compute-bounded and the better kernel will have more advantages. |
|
Merging, I'll follow up with wgrad + parallel split-k support. |
* add conv2d transpose nhwc cudnn test * support conv2d transpose nhwc direct offload to cudnn * add cutlass dgrad support * remove unused arg * allow target none * fix beta initiaization condition * disable dynamic dense fp16 test since it fails on cuda 11.6
Adds dgrad support. Wgrad is more complicated and I'm having weird accuracy issues, so it will come later.
UPDATE: See the latest result in #10110 (comment)
@comaniac @Laurawly @junrushao1994 @vinx13 @YuchenJin @hwu36 @manishucsd
Old results below, not relevant anymore
Linked below is a benchmark result against cuDNN on resnet50 workloads, with batch size 8 and 256. All numbers in milli second, generated on RTX 3070 by this script
It's interesting to note that, on batch size = 8, cutlass is mostly faster while on batch size = 256, cuDNN is faster. Looking at nvprof dump, it turns out that even if the e2e time, as reported by TVM's
time_evaluator, shows cutlass being faster, cuDNN could be winning in the kernel-only time. For example, the first row of batch 8 case shows thatcutlass vs cudnn = 54 vs 109 usec. But nvprof shows:This means more than half of cuDNN e2e time is spent on overhead, either inside TVM during cuDNN call or within cuDNN itself. Apparently, cutlass has much smaller overhead.
cuDNN is mostly faster in batch 256 case. This could be due to overhead being small. In particular, the difference is large for stride = 2 cases. For example, on the 5-th row, which shows
cutlass vs cudnn = 4.18 vs 1.71 msec, nvprof showswhich suggests cuDNN's strided dgrad being significantly better than cutlass (?) @manishucsd
However, even on the larger batch size, cutlass is always winning on workloads with filter size 3. For example, here is the nvprof dump for the thrid row.