x64: Lower shuffle and swizzle in ISLE#4772
Conversation
Subscribe to Label ActionDetailsThis issue or pull request has been labeled: "cranelift", "cranelift:area:machinst", "cranelift:area:x64", "isle"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
28fa27d to
d52bd6d
Compare
| ; movdqa %xmm0, %xmm9 | ||
| ; load_const VCodeConstant(0), %xmm0 | ||
| ; vpermi2b %xmm1, %xmm0, %xmm9 | ||
| ; movq %rbp, %rsp | ||
| ; popq %rbp | ||
| ; ret |
There was a problem hiding this comment.
Here's the case where the mask wasn't necessary, thus no andps instruction was generated.
| ; movdqa %xmm0, %xmm12 | ||
| ; load_const VCodeConstant(1), %xmm0 | ||
| ; load_const VCodeConstant(0), %xmm7 | ||
| ; vpermi2b %xmm1, %xmm7, %xmm12 | ||
| ; andps %xmm0, %xmm7, %xmm0 | ||
| ; movq %rbp, %rsp | ||
| ; popq %rbp | ||
| ; ret |
There was a problem hiding this comment.
Here's a case where the permutation contained out-of-bounds values, so the andps on line 38 is necessary.
| v2 = shuffle v0, v1, [3 0 32 255 4 6 12 11 23 13 24 4 2 97 17 5] | ||
| return v2 | ||
| } | ||
| ; run: %shuffle_zeros([1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16], [17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32]) == [4 1 0 0 5 7 13 12 24 14 25 5 3 0 18 6] |
There was a problem hiding this comment.
This test already passed for the non-avx512 lowering, and this PR allows it to pass when avx512 is available as well.
There was a problem hiding this comment.
Excellent -- it sounds like you've tested on avx512 hardware?
(As an aside, it might be nice to think about ways to use qemu to get us AVX512 testing in CI, if the GitHub Actions host doesn't natively have it. @abrown any thoughts on that?)
There was a problem hiding this comment.
I have -- my laptop has those extensions :D
| (_ Unit (emit (gen_move $I8X16 dst src3))) | ||
| (_ Unit (emit (MInst.XmmRmREvex (Avx512Opcode.Vpermi2b) | ||
| src1 | ||
| src2 | ||
| dst)))) |
There was a problem hiding this comment.
I'm a little unhappy about this, but we don't have an encoding for xmm instructions that have three arguments currently.
There was a problem hiding this comment.
Yeah, we can definitely add such a thing later, and should I think; we'll get to this as part of our "no more mod operands" cleanup on regalloc operands, if not before.
cfallin
left a comment
There was a problem hiding this comment.
Thanks! This all looks great overall.
| (_ Unit (emit (gen_move $I8X16 dst src3))) | ||
| (_ Unit (emit (MInst.XmmRmREvex (Avx512Opcode.Vpermi2b) | ||
| src1 | ||
| src2 | ||
| dst)))) |
There was a problem hiding this comment.
Yeah, we can definitely add such a thing later, and should I think; we'll get to this as part of our "no more mod operands" cleanup on regalloc operands, if not before.
| ;; Produce a permutation suitable for use with `vpermi2b`, for permuting two | ||
| ;; I8X16 vectors simultaneously. NOTE: this will not avoid out-of-bounds values, | ||
| ;; and the internal lane value masking of vpermi2b will come into play. If you | ||
| ;; need the out-of-bounds behavior of shuffle, you'll need to also mask the |
There was a problem hiding this comment.
could we say what that behavior is and/or mention "CLIF-level shuffle" here? otherwise it's a bit unclear if one doesn't already have the context I think.
| ;; However, if the shuffle mask contains no out-of-bounds values, we can use | ||
| ;; `vpermi2b` without any masking. | ||
| (rule (lower (has_type (and (avx512vl_enabled) (avx512vbmi_enabled)) | ||
| (shuffle a b (vec_mask_from_immediate mask)))) |
There was a problem hiding this comment.
Here we're relying on implicit firing-order heuristics (the above rule before this one, specifically perm_from_mask... extractor before mask variable binding); I think tests below should ensure this works properly but just wanted to call it out to be sure.
There was a problem hiding this comment.
Yep, we should have enough test coverage to catch problems with these two: there are precise-output tests for both rules.
| (x64_pshufb a (x64_xmm_load_const $I8X16 (shuffle_0_15_mask mask))) | ||
| (x64_pshufb b (x64_xmm_load_const $I8X16 (shuffle_16_31_mask mask))))) | ||
|
|
||
| ;; Rules for `shuffle` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; |
There was a problem hiding this comment.
Thanks for catching that!
| v2 = shuffle v0, v1, [3 0 32 255 4 6 12 11 23 13 24 4 2 97 17 5] | ||
| return v2 | ||
| } | ||
| ; run: %shuffle_zeros([1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16], [17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32]) == [4 1 0 0 5 7 13 12 24 14 25 5 3 0 18 6] |
There was a problem hiding this comment.
Excellent -- it sounds like you've tested on avx512 hardware?
(As an aside, it might be nice to think about ways to use qemu to get us AVX512 testing in CI, if the GitHub Actions host doesn't natively have it. @abrown any thoughts on that?)
Lower
shuffleandswizzlein ISLE.This PR surfaced a bug with the lowering of
shufflewhen avx512vl and avx512vbmi are enabled: we usevpermi2bas the implementation, but panic if the immediate shuffle mask contains any out-of-bounds values. The behavior when the avx512 extensions are not present is that out-of-bounds values are turned into0in the result.I've resolved this by detecting when the shuffle immediate has out-of-bounds indices in the avx512-enabled lowering, and generating an additional mask to zero out the lanes where those indices occur. This brings the avx512 case into line with the semantics of the
shuffleop:wasmtime/cranelift/codegen/meta/src/shared/instructions.rs
Lines 1495 to 1498 in 94bcbe8