Add i32.popcnt and i64.popcnt to winch#6531
Conversation
Subscribe to Label ActionDetailsThis issue or pull request has been labeled: "winch"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
saulecabrera
left a comment
There was a problem hiding this comment.
Thanks for putting this together! The current changes look good to me. I have one thought/comment though: are you planning on providing a fallback implementation for popcnt if the has_popcnt flag is not enabled, similar to what Cranelift provides? If it's too burdensome to do as part of this PR a TODO is also totally fine I think, but we might want to consider updating the fuzzing configuration so that it always enables the has_popcnt flag to avoid failures when fuzzing winch. Even though we only enable the fuzzer locally, it's still useful for verifying our changes.
Also, you might also want to add <I32|I64>Popcnt here https://github.com/bytecodealliance/wasmtime/blob/main/fuzz/fuzz_targets/differential.rs#L335
Yeah I can take a crack at that! I'll also add it to the fuzz targets |
|
@saulecabrera I added two |
saulecabrera
left a comment
There was a problem hiding this comment.
I think that it'd be nice to save a couple of instructions if possible, so I'm leaning towards the second fallback. This is also how SpiderMonkey handles this case https://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/MacroAssembler-x86-shared-inl.h#110
I left a couple of extra comments, but feel free to ignore if those are things that you were already thinking about!
| }); | ||
|
|
||
| let fives = regs::scratch(); | ||
| self.load_constant(&0x5555555555555555, fives, size); |
There was a problem hiding this comment.
If I'm not wrong, in the case of a fallback we'd need extra temporary register because the scratch is clobbered by all the calls to load_constant right?
There was a problem hiding this comment.
If that's the case, perhaps we can pass in a mutable reference to the CodeGenContext to MacroAssembler::popcnt and do the dispatching at that level and if we need to emit the fallback, we can request an extra temporary register via CodeGenContext::any_gpr? Similar to to the implementation of clz https://github.com/bytecodealliance/wasmtime/blob/main/winch/codegen/src/isa/x64/masm.rs#L343
There was a problem hiding this comment.
scratch is clobbered by all the calls to load_constant right?
Oh this is good to know! I didn't know that. I am still learning the relationships between the different reg types and how their different abstraction layers are used, so if it seems like I'm doing something weird, please let me know :)
I'll take the approach in your second comment!
| } | ||
| } | ||
|
|
||
| fn popcnt_fallback2(&mut self, size: OperandSize, reg: Reg) { |
There was a problem hiding this comment.
Could we replace the emit calls in this function with the emit functions in the assembler? That has the added benefit that it already handles constant loading based on the operand size.
There was a problem hiding this comment.
I think so. So, for instance I could replace this:
self.emit(Inst::AluRmiR {
size: size.into(),
op: AluRmiROpcode::Sub,
src1: reg.into(),
src2: masked1.into(),
dst: reg.into(),
});
with something like this:
self.sub_rr(reg, masked, size);
right? I noticed some of the instructions don't have corresponding functions on the assembler (shiftr, and), should I add functions for those?
There was a problem hiding this comment.
Yeah exactly.
For shift we have shift_ir and shift_rr in the assembler and for and we have and_rr and and_ir , I think that should cover all the cases for the lowering here. But if it doesn't feel free to make any additions to the assembler!
e42626d to
103115d
Compare
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com> Co-authored-by: Chris Fallin <chris@cfallin.org>
Move popcnt fallback up into the macroassembler. Share code between 32-bit and 64-bit popcnt Add Popcnt to winch differential fuzzing
d65bd49 to
7e301f8
Compare
|
I moved the fallback logic up into the MacroAssembler. I was running into some tricky ownership issues when trying to pass the |
jameysharp
left a comment
There was a problem hiding this comment.
Seems right to me! I agree that tests are a good idea before merging but assuming those go okay then I think this is good.
| self.asm.sub(dst.into(), tmp.into(), size); | ||
|
|
||
| // x = (x & m2) + ((x >> 2) & m2); | ||
| self.asm.mov(tmp.into(), dst.into(), size); |
There was a problem hiding this comment.
Can this and the other mov below be written like this? I'm not sure it makes any difference but it seems a little more clear to me if we can avoid using .into() so much. I'm also curious if there are other methods named perhaps sub_rr or and_ir for the other instructions in this sequence.
| self.asm.mov(tmp.into(), dst.into(), size); | |
| self.asm.mov_rr(tmp, dst, size); |
There was a problem hiding this comment.
I'm also curious if there are other methods named perhaps sub_rr or and_ir for the other instructions in this sequence.
there are! I'll change those
Subscribe to Label Actioncc @fitzgen DetailsThis issue or pull request has been labeled: "fuzzing"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
The scratch register was getting clobbered by the calls to `and`, so this is instead passing in a CodeGenContext to the masm's `popcnt` and letting it handle its own registers
|
@saulecabrera I moved the register management bit to the MacroAssembler and added a couple of filetests for the fallback. I manually tested the fallback by forcing that branch in the code (just added an It seems to behave as intended! Let me know if there's anything else that needs cleaning up, otherwise I think this might be good to go! |
saulecabrera
left a comment
There was a problem hiding this comment.
Left one minor comment, but overall this looks great to me, thanks!
| src: Gpr::new(src.into()).unwrap().into(), | ||
| dst: Writable::from_reg(Gpr::new(src.into()).unwrap()), |
There was a problem hiding this comment.
We currently have impl From<Reg> for Gpr and impl From<Reg> for WritableGpr so you could use those here if you wanted to reduce the boilerplate.
| self.asm.popcnt(src, size); | ||
| context.stack.push(Val::reg(src)); | ||
| } else { | ||
| let tmp = context.any_gpr(self); |
There was a problem hiding this comment.
Would you add a comment that the fallback is based on MacroAssembler::popcnt32 in https://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/MacroAssembler-x86-shared-inl.h?
| let (masks, shift_amt) = match size { | ||
| OperandSize::S64 => ( | ||
| [ | ||
| 0x5555555555555555, // m1 | ||
| 0x3333333333333333, // m2 | ||
| 0x0f0f0f0f0f0f0f0f, // m4 | ||
| 0x0101010101010101, // h01 | ||
| ], | ||
| 56u8, | ||
| ), | ||
| // 32-bit popcount is the same, except the masks are half as | ||
| // wide and we shift by 24 at the end rather than 56 | ||
| OperandSize::S32 => ( | ||
| [0x55555555i64, 0x33333333i64, 0x0f0f0f0fi64, 0x01010101i64], | ||
| 24u8, | ||
| ), | ||
| }; |
There was a problem hiding this comment.
This is perfectly reasonable as-is but it keeps bothering me that the constant masks are duplicated in this way. One alternative might be:
| let (masks, shift_amt) = match size { | |
| OperandSize::S64 => ( | |
| [ | |
| 0x5555555555555555, // m1 | |
| 0x3333333333333333, // m2 | |
| 0x0f0f0f0f0f0f0f0f, // m4 | |
| 0x0101010101010101, // h01 | |
| ], | |
| 56u8, | |
| ), | |
| // 32-bit popcount is the same, except the masks are half as | |
| // wide and we shift by 24 at the end rather than 56 | |
| OperandSize::S32 => ( | |
| [0x55555555i64, 0x33333333i64, 0x0f0f0f0fi64, 0x01010101i64], | |
| 24u8, | |
| ), | |
| }; | |
| let masks = [ | |
| 0x5555555555555555, // m1 | |
| 0x3333333333333333, // m2 | |
| 0x0f0f0f0f0f0f0f0f, // m4 | |
| 0x0101010101010101, // h01 | |
| ]; | |
| let (mask, shift_amt) = match size { | |
| OperandSize::S64 => (u64::MAX as i64, 56u8), | |
| OperandSize::S32 => (u32::MAX as i64, 24u8), | |
| }; |
Then use e.g. masks[0] & mask. (Maybe rename mask though.)
Another approach is to generate the constants using bit-twiddling tricks. I don't think this is a good idea but I spent enough time figuring out how it would work that I'm going to write it down anyway. You can divide u64::MAX or u32::MAX by certain constants to get these repeating patterns: specifically, [0x3, 0x5, 0x11, 0xff] to produce [0x55..., 0x33..., 0x0f..., 0x01...] respectively.
There was a problem hiding this comment.
I thiiiink I'm inclined to keep it as is, though I don't feel too strongly about that. I initially had the 32-bit and 64-bit code as completely separate branches (this is what spidermonkey and cranelift do) and wasn't sure if I should even combine them in the first place. I kind of like the different constants being explicitly in the code, but I also see that using one to generate the other is tempting. Happy to go either way here
There was a problem hiding this comment.
Yup, I don't feel strongly about it either, so I'm good with leaving it as is. 👍
|
|
||
| // x -= (x >> 1) & m1; | ||
| self.asm.shift_ir(1u8, dst, ShiftKind::ShrU, size); | ||
| self.asm.and(RegImm::imm(masks[0]).into(), dst.into(), size); |
There was a problem hiding this comment.
I was pretty confused about why you didn't use and_ir, until I dug into the implementation enough to understand that x86 only supports 32-bit immediates for and, so doing it this way lets the underlying assembler decide whether to use load_constant into the scratch register or to emit the immediate operand inline.
I don't know what to suggest but maybe there's a comment that could go here. Or maybe we should just assume that people reading this are more familiar with Winch idioms than I am. 😆
Also, it's a little unfortunate that in the 64-bit case, masks[1] gets loaded into the scratch register twice. It would be nice to avoid that, but it would clutter the implementation here and maybe that's not worth doing.
There was a problem hiding this comment.
Or maybe we should just assume that people reading this are more familiar with Winch idioms than I am. 😆
The scratch register has caused confusion, I agree. This is something that I'd like to improve, so I'm considering adding guards, which will hopefully make its usage traceable and more explicit.
There was a problem hiding this comment.
I just haven't gotten to it 😄
There was a problem hiding this comment.
Yeah I did have the thought about 0x333... getting loaded in twice, so I had briefly replaced the second RegImm::imm(masks[0]).into() with regs::scratch().into(), but that wouldn't work in the 32-bit case and then trying to work around that seemed like added complexity for not a lot of gain
There was a problem hiding this comment.
I don't think it would be terrible to unconditionally call load_constant here to put the 0x333... mask in the scratch register, and use and_rr explicitly. It's one extra instruction in the 32-bit case, but it's only a move-immediate, and the generated code is shorter due to only emitting the constant once.
No description provided.