x64: port select to ISLE#3682
Conversation
|
This is a complex lowering due to not only the tree-matching of the CLIF IR but also the sequences required by x64. Because of this the PR will need to go through a few more changes:
I'm putting this PR up as a draft to see if I can get some help with the above and early feedback on any suggested refactoring to the rules. |
Subscribe to Label ActionDetailsThis issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
| size: *size, | ||
| consequent: consequent.clone(), | ||
| alternative: alternative.clone(), | ||
| dst: dst.clone(), |
There was a problem hiding this comment.
This overwrites dst from above, I think -- it seems we want the first cmove to write a tmp, and alternative in the second to be that tmp? So then we have the equivalent of cmove(cc1, a, cmove(cc2, b, alternative)).
fitzgen
left a comment
There was a problem hiding this comment.
This is looking pretty good to me!
I think with some additional run tests we can start landing this incremental work and then finish up with the rest of the select lowerings in follow up PRs (if you want; if you prefer doing everything in one big PR that's no problem too).
|
@cfallin, @fitzgen: I am finding some very strange behavior with ISLE and I'm not sure what is going on. Take, e.g., the following runtest: If we run that ( The reason for this is that In the code on this branch, both lowerings use the I then try to determine if in fact the "less than or equal" rule is firing in both cases. Using a well-placed But why is the rule not firing in the first case? After debugging through Any help? |
You can use (rule (lower (select val @ (value_type val_ty) ...))
...) |
|
@fitzgen, @cfallin: this should be ready for review. The I've also opened #3743 and #3744 as a result of this work. Work that is still TODO but could probably go in separate PRs is to translate the other side of the |
fitzgen
left a comment
There was a problem hiding this comment.
How confident are you in these lowerings? We talked before about getting the clif interpreter and fuzzer working for f32/f64 constants, fcmp, and select. Should we do that before landing these changes?
| (let ((cons ValueRegs (put_in_regs consequent)) | ||
| (alt ValueRegs (put_in_regs alternative)) |
There was a problem hiding this comment.
I think that cmove_from_values should take these ValueRegs instead of Values. This lets us reuse it in places where we've already put the values into registers.
(decl cmove_from_values (Type CC ValueRegs ValueRegs) ConsumesFlags)
(rule (cmove_from_vlaue $I128 cc consequent alternative)
...)There was a problem hiding this comment.
The downside to that is that it bloats the rules in lower.isle; can we do this refactoring in a future PR or issue, e.g., the next time cmove is used?
| ;; TODO there must be some way to say that both of these | ||
| ;; instructions consume the flags together. | ||
| (_ Unit (emit (MInst.Cmove size cc (RegMem.Reg (value_regs_get cons 0)) (value_regs_get alt 0) dst1)))) |
There was a problem hiding this comment.
Doing the emit here is actually unsafe because nothing is guaranteeing that
- we just emitted the flags producer before this
- nothing else is emitted after the flags producer and before this flags consumer that clobbers the flags
What we want to do is return two ConsumesFlags here that we can then pass into with_flags_2 along with a FlagsProducer.
ISLE doesn't have tuple/fixed-size-arrays support, so we will have to define an extern type for this.
But! The other rule for cmove_from_values for when it is a single-register value emits only one ConsumesFlags, so I think we actually want an extern type that is either one or two ConsumesFlags. I think we can just use a SmallVec<[ConsumesFlags; 2]>.
There was a problem hiding this comment.
Note that we will want a with_flags_* variant that takes the SmallVec<[ConsumesFlags; 2]> as well, and it will probably have to be implemented in external Rust code so it can loop over each ConsumesFlags in the small vec.
There was a problem hiding this comment.
This sounds like a lot of new infrastructure to build and this PR may already have enough of that. The current lowering rules perform the same unsafe emission as this PR. Can't this be done afterwards?
There was a problem hiding this comment.
The current lowering rules for select are written in Rust, so it isn't really comparable. It would be one thing if this emit was near the emit for the flags-producer instruction and the emit for the other flags-consuming instruction, but it isn't. It isn't even clear to me, reading this code, that the instructions will always be emitted in the correct order!
It seems to me that the order of evaluation will be something like
- construct
FlagsProducer(do not emit it) - call this constructor
- emit first cmove
- construct FlagsConsumer with the second cmove
- call a
with_flagsvariant with the constructedFlagsProducerandFlagsConsumer- emit the producer
- emit the consumer
If that is actually how things end up evaluating, then we would get cmove; test/cmp; cmove instead of test/cmp; cmove; cmove.
It looks like we aren't testing select with i128s yet (since you have only ported floating point selects so far). But this is exactly the kind bug I want to avoid with a dedicated with_flags_* variant that takes SmallVec<[ConsumesFlags; 2]>, as described above.
I really think we should fix this issue before landing this PR, or change this PR to not define and use cmove_from_values and instead use plain cmove (which will work for all the single-register values for select that are are currently supported in ISLE).
There was a problem hiding this comment.
The extensions to with_flags, ProducesFlags, and ConsumesFlags look good!
I would vote "no." ISLE basically creates a conflict for every rebased commit requiring an ISLE rebuild and Git twiddling to ensure things are right before moving on. For those of use who try to keep a clean history with atomic commits, this additional burden makes keeping a long-running PR like this pretty painful. As you can see, I'm now just committing review patches in hopes that I can do one final rebase before merging. Also, the CLIF interpreter has not been extended for any of the other ISLE lowerings so doing it here is extra work that no other PRs have had to do. I don't mind doing this kind of thing but the "build up the infrastructure before merging" tasks are starting to pile up: bringing |
|
@abrown @fitzgen from my own perspective at least, I think it's fair to merge this at its current state and take some followup PRs/tasks -- I agree with @abrown that this pretty much maintains our status-quo in the sense that it passes existing tests. Deferring to @fitzgen on the detailed review discussions above of course... |
bcc3249 to
dff48f9
Compare
|
Not sure why aarch64 and s390x CLIF tests are failing--they shouldn't be running! |
cfallin
left a comment
There was a problem hiding this comment.
r+ from me at least, having been through this code while resolving some issues with @abrown last week, I think the basic approach is sound and this is more than enough improvement to merge at its current state, even if we still have further to go to complete the transition on this opcode. Also, as you mentioned above, the rebasing pain increases the longer we keep this pending, and there are other changes (e.g. implicit conversions) I want to land that should come after this.
It looks like there are some filetest failures, but I didn't look in-depth at what's causing them; perhaps some holes in the pattern coverage or some code that can't quite be deleted from the handwritten side yet?
It looks like @fitzgen is happy with the refactorings we did together but I'll defer to him to make sure he's happy with the final state of this.
Thanks again for all the patience here!
fitzgen
left a comment
There was a problem hiding this comment.
Yep, let's merge this and incrementally continue improving it in follow ups.
This change includes quite a few interlocking parts, required mainly by the current x64 conventions in ISLE: - it adds a way to emit a `cmove` with multiple OR-ing conditions; because x64 ISLE cannot currently safely emit a comparison followed by several jumps, this adds `MachInst::CmoveOr` and `MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro instructions hide the multi-instruction sequence in `lower.isle` - to properly keep track of what instructions consume and produce flags, @cfallin added a way to pass around variants of `ConsumesFlags` and `ProducesFlags`--these changes affect all backends - then, to lower the `fcmp + select` CLIF, this change adds several `cmove*_from_values` helpers that perform all of the awkward conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one upside is that now these lowerings have much-improved documentation explaining why the various `FloatCC` and `CC` choices are made the the way they are. Co-authored-by: Chris Fallin <chris@cfallin.org>
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in bytecodealliance#2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in bytecodealliance#3682. The existing fcmp-of-load test in `cmp-mem-bug` (from bytecodealliance#2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in #2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in #3682. The existing fcmp-of-load test in `cmp-mem-bug` (from #2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.
* x64: port `select` using an FP comparison to ISLE This change includes quite a few interlocking parts, required mainly by the current x64 conventions in ISLE: - it adds a way to emit a `cmove` with multiple OR-ing conditions; because x64 ISLE cannot currently safely emit a comparison followed by several jumps, this adds `MachInst::CmoveOr` and `MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro instructions hide the multi-instruction sequence in `lower.isle` - to properly keep track of what instructions consume and produce flags, @cfallin added a way to pass around variants of `ConsumesFlags` and `ProducesFlags`--these changes affect all backends - then, to lower the `fcmp + select` CLIF, this change adds several `cmove*_from_values` helpers that perform all of the awkward conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one upside is that now these lowerings have much-improved documentation explaining why the various `FloatCC` and `CC` choices are made the the way they are. Co-authored-by: Chris Fallin <chris@cfallin.org>
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in bytecodealliance#2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in bytecodealliance#3682. The existing fcmp-of-load test in `cmp-mem-bug` (from bytecodealliance#2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in #2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in #3682. The existing fcmp-of-load test in `cmp-mem-bug` (from #2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.
This change ports CLIF's
selectinstruction to ISLE for x64.