[Relax] Integrate cuDNN attention#17157
Merged
vinx13 merged 9 commits intoapache:mainfrom Jul 22, 2024
Merged
Conversation
yongwww
approved these changes
Jul 15, 2024
sunggg
reviewed
Jul 15, 2024
Contributor
sunggg
left a comment
There was a problem hiding this comment.
Thank you @vinx13 for the new addition!
Overall, looks good to me.
Would you describe high-level strategy for attention somewhere? (e.g., when to offload cudnn, cutlass, TIR, etc.)
If this is about landing machinery rather than such offloading decision, would appreciate if you can provide some recommendations.
Member
Author
|
The new attention can be applied via cudnn BYOC. The decision of which BYOC backend (cudnn, cutlass) to use is left to the users. cudnn is likely to perform better on H100 as it has specific optimizations |
masahi
approved these changes
Jul 16, 2024
Member
masahi
left a comment
There was a problem hiding this comment.
I remember cuDNN attention supports fp8, would be interesting to support that too.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This integrates cuDNN attention kernels to BYOC.
A dependency of cudnn_frontend is added.
The cuDNN attention kernel supports fused qkv in BS3NH and SBN3H layouts.
cc @sunggg @masahi @yongwww @tqchen