x64: port select to ISLE by abrown · Pull Request #3682 · bytecodealliance/wasmtime

abrown · 2022-01-11T22:03:28Z

This change ports CLIF's select instruction to ISLE for x64.

abrown · 2022-01-11T22:11:22Z

This is a complex lowering due to not only the tree-matching of the CLIF IR but also the sequences required by x64. Because of this the PR will need to go through a few more changes:

add a way to emit a sequence like the following for FloatCC.Equal and FloatCC.NotEqual: UNORD* + MOV + JNP + MOV + JZ + MOV--because this sequence must reuse the same flags (and avoid modifications to them, @cfallin suggested implementing a new Inst::XmmCmoveOr (but that also implies building a new Inst::CmoveOr as well)
add a way to emit conditional moves for multiple register values
implement the integer comparison side of these instructions (both the tree-matching version and the non-tree-matching versions)

I'm putting this PR up as a draft to see if I can get some help with the above and early feedback on any suggested refactoring to the rules.

github-actions · 2022-01-11T22:31:59Z

Subscribe to Label Action

cc @cfallin, @fitzgen

Details

This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle
fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

cfallin · 2022-01-12T18:51:24Z

+                size: *size,
+                consequent: consequent.clone(),
+                alternative: alternative.clone(),
+                dst: dst.clone(),


This overwrites dst from above, I think -- it seems we want the first cmove to write a tmp, and alternative in the second to be that tmp? So then we have the equivalent of cmove(cc1, a, cmove(cc2, b, alternative)).

fitzgen

This is looking pretty good to me!

I think with some additional run tests we can start landing this incremental work and then finish up with the rest of the select lowerings in follow up PRs (if you want; if you prefer doing everything in one big PR that's no problem too).

abrown · 2022-01-15T00:56:25Z

@cfallin, @fitzgen: I am finding some very strange behavior with ISLE and I'm not sure what is going on. Take, e.g., the following runtest:

test run
target x86_64

function %select_le_i32(f32, f32) -> i32 {
block0(v0: f32, v1: f32):
    v2 = fcmp le v0, v1
    v3 = iconst.i32 1
    v4 = iconst.i32 0
    v5 = select v2, v3, v4
    return v5
}
; run: %select_le_i32(0x42.42, 0.0) == 0
; run: %select_le_i32(0.0, 0.0) == 1
; run: %select_le_i32(0x0.0, 0x42.42) == 1
; run: %select_le_i32(0x0.0, NaN) == 0

function %select_le_f32(f32, f32) -> f32 {
block0(v0: f32, v1: f32):
    v2 = fcmp le v0, v1
    v3 = f32const 0x1.0
    v4 = f32const 0x0.0
    v5 = select v2, v3, v4
    return v5
}
; run: %select_le_f32(0x42.42, 0.0) == 0x0.0
; run: %select_le_f32(0.0, 0.0) == 0x1.0
; run: %select_le_f32(0x0.0, 0x42.42) == 0x1.0
; run: %select_le_f32(0x0.0, NaN) == 0x0.0

If we run that (RUST_LOG=cranelift_codegen::machinst::compile=trace cargo run -p cranelift-tools -- test -dv scratch.clif), the second set of tests fail:

FAIL scratch.clif: run

Caused by:
    Failed test: run: %select_le_f32(0x1.090800p6, 0.0) == 0.0, actual: 0x1.000000p

The reason for this is that %xmm0 and %xmm1 are magically "flipped" when passed to the comparison instruction, ucomiss. The first function, select_le_i32, has the correct order (ucomiss %xmm0, %xmm1) whereas the second does not (ucomiss %xmm1, %xmm0). This is visible in the log statements for each (in order):

 TRACE cranelift_codegen::machinst::compile > vcode after regalloc: final version:
VCode_ShowWithRRU {{
  Entry block: 0
Block 0:
  (original IR block: block0)
  (instruction range: 0 .. 10)
  Inst 0:   pushq   %rbp
  Inst 1:   movq    %rsp, %rbp
  Inst 2:   xorl    %esi, %esi
  Inst 3:   movl    $1, %edi
  Inst 4:   ucomiss %xmm0, %xmm1
  Inst 5:   cmovnbl %edi, %esi
  Inst 6:   movq    %rsi, %rax
  Inst 7:   movq    %rbp, %rsp
  Inst 8:   popq    %rbp
  Inst 9:   ret
}}

 TRACE cranelift_codegen::machinst::compile > vcode after regalloc: final version:
VCode_ShowWithRRU {{
  Entry block: 0
Block 0:
  (original IR block: block0)
  (instruction range: 0 .. 11)
  Inst 0:   pushq   %rbp
  Inst 1:   movq    %rsp, %rbp
  Inst 2:   xorps   %xmm2, %xmm2
  Inst 3:   movl    $1065353216, %esi
  Inst 4:   movd    %esi, %xmm3
  Inst 5:   ucomiss %xmm1, %xmm0
  Inst 6:   jb $next; movss %xmm3, %xmm2; $next: 
  Inst 7:   movaps  %xmm2, %xmm0
  Inst 8:   movq    %rbp, %rsp
  Inst 9:   popq    %rbp
  Inst 10:   ret
}}

In the code on this branch, both lowerings use the fpcmp helper for the comparison and the cmove helper to emit either a true CMOV in the first case or a jump version in the second. I would expect both to have the same ucomiss ordering but they do not.

I then try to determine if in fact the "less than or equal" rule is firing in both cases. Using a well-placed panic! I now see that it is not firing in the first case but is in the second (this implies that I will have to rather unintuitively flip the operands in the rule to make the lowering come out right).

But why is the rule not firing in the first case? After debugging through generated_code.rs for a while, it becomes clear that it is bailing out in the fpcmp helper. Ok, I guess the type being passed in to the helper should be the comparison type, not the select type. [...spends time trying to figure out how to grab that type]. I guess first that has_type wraps def_inst--nope. Does it wrap the fcmp itself? Nope. How about wrapping one of fcmp's operands? That fails with an Unknown variable 'a'.

Any help?

fitzgen · 2022-01-18T17:50:53Z

Ok, I guess the type being passed in to the helper should be the comparison type, not the select type. [...spends time trying to figure out how to grab that type].

You can use value_type to extract the type of a value, e.g. an operand to the instruction you are lowering:

(rule (lower (select val @ (value_type val_ty) ...))
      ...)

abrown · 2022-01-29T01:46:57Z

@fitzgen, @cfallin: this should be ready for review. The lower.isle code is a lot more readable now but I suspect there are better ways to do things; concretely, to somehow avoid the "use values instead of regs" pattern and the hard-coding in the i128 versions of cmove.

I've also opened #3743 and #3744 as a result of this work. Work that is still TODO but could probably go in separate PRs is to translate the other side of the select, the integer side, to ISLE and upgrade the CLIF fuzzer to generate these kinds of instructions (it only emits simple numerics now).

fitzgen

How confident are you in these lowerings? We talked before about getting the clif interpreter and fuzzer working for f32/f64 constants, fcmp, and select. Should we do that before landing these changes?

fitzgen · 2022-01-31T18:41:07Z

+      (let ((cons ValueRegs (put_in_regs consequent))
+            (alt ValueRegs (put_in_regs alternative))


I think that cmove_from_values should take these ValueRegs instead of Values. This lets us reuse it in places where we've already put the values into registers.

(decl cmove_from_values (Type CC ValueRegs ValueRegs) ConsumesFlags) (rule (cmove_from_vlaue $I128 cc consequent alternative) ...)

The downside to that is that it bloats the rules in lower.isle; can we do this refactoring in a future PR or issue, e.g., the next time cmove is used?

Sure, that works for me.

fitzgen · 2022-01-31T18:46:01Z

+            ;; TODO there must be some way to say that both of these
+            ;; instructions consume the flags together.
+            (_ Unit (emit (MInst.Cmove size cc (RegMem.Reg (value_regs_get cons 0)) (value_regs_get alt 0) dst1))))


Doing the emit here is actually unsafe because nothing is guaranteeing that

we just emitted the flags producer before this

nothing else is emitted after the flags producer and before this flags consumer that clobbers the flags

What we want to do is return two ConsumesFlags here that we can then pass into with_flags_2 along with a FlagsProducer.

ISLE doesn't have tuple/fixed-size-arrays support, so we will have to define an extern type for this.

But! The other rule for cmove_from_values for when it is a single-register value emits only one ConsumesFlags, so I think we actually want an extern type that is either one or two ConsumesFlags. I think we can just use a SmallVec<[ConsumesFlags; 2]>.

Note that we will want a with_flags_* variant that takes the SmallVec<[ConsumesFlags; 2]> as well, and it will probably have to be implemented in external Rust code so it can loop over each ConsumesFlags in the small vec.

This sounds like a lot of new infrastructure to build and this PR may already have enough of that. The current lowering rules perform the same unsafe emission as this PR. Can't this be done afterwards?

The current lowering rules for select are written in Rust, so it isn't really comparable. It would be one thing if this emit was near the emit for the flags-producer instruction and the emit for the other flags-consuming instruction, but it isn't. It isn't even clear to me, reading this code, that the instructions will always be emitted in the correct order!

It seems to me that the order of evaluation will be something like

construct FlagsProducer (do not emit it)

call this constructor

emit first cmove

construct FlagsConsumer with the second cmove

call a with_flags variant with the constructed FlagsProducer and FlagsConsumer

emit the producer

emit the consumer

If that is actually how things end up evaluating, then we would get cmove; test/cmp; cmove instead of test/cmp; cmove; cmove.

It looks like we aren't testing select with i128s yet (since you have only ported floating point selects so far). But this is exactly the kind bug I want to avoid with a dedicated with_flags_* variant that takes SmallVec<[ConsumesFlags; 2]>, as described above.

I really think we should fix this issue before landing this PR, or change this PR to not define and use cmove_from_values and instead use plain cmove (which will work for all the single-register values for select that are are currently supported in ISLE).

Ok, @cfallin and I (but mostly @cfallin), worked on this a bit. The unpolished results are in 667935f--@fitzgen, thoughts?

The extensions to with_flags, ProducesFlags, and ConsumesFlags look good!

abrown · 2022-02-01T21:35:58Z

How confident are you in these lowerings? We talked before about getting the clif interpreter and fuzzer working for f32/f64 constants, fcmp, and select. Should we do that before landing these changes?

I would vote "no." ISLE basically creates a conflict for every rebased commit requiring an ISLE rebuild and Git twiddling to ensure things are right before moving on. For those of use who try to keep a clean history with atomic commits, this additional burden makes keeping a long-running PR like this pretty painful. As you can see, I'm now just committing review patches in hopes that I can do one final rebase before merging.

Also, the CLIF interpreter has not been extended for any of the other ISLE lowerings so doing it here is extra work that no other PRs have had to do. I don't mind doing this kind of thing but the "build up the infrastructure before merging" tasks are starting to pile up: bringing FloatCC to ISLE (which turned into bringing all enums to ISLE), create a new machinst variant for CmoveOr, create ISLE helpers for various things (cmove, FP comparison, register class checks), create CLIF run tests for select, find and report bugs with extractors, and (just recently added) figure out a way to express that two instructions consume the same flags. So I was hoping to not have to build up a fuzz target to get this merged.

cfallin · 2022-02-01T21:41:50Z

@abrown @fitzgen from my own perspective at least, I think it's fair to merge this at its current state and take some followup PRs/tasks -- I agree with @abrown that this pretty much maintains our status-quo in the sense that it passes existing tests. Deferring to @fitzgen on the detailed review discussions above of course...

abrown · 2022-02-18T21:38:56Z

Not sure why aarch64 and s390x CLIF tests are failing--they shouldn't be running!

cfallin

r+ from me at least, having been through this code while resolving some issues with @abrown last week, I think the basic approach is sound and this is more than enough improvement to merge at its current state, even if we still have further to go to complete the transition on this opcode. Also, as you mentioned above, the rebasing pain increases the longer we keep this pending, and there are other changes (e.g. implicit conversions) I want to land that should come after this.

It looks like there are some filetest failures, but I didn't look in-depth at what's causing them; perhaps some holes in the pattern coverage or some code that can't quite be deleted from the handwritten side yet?

It looks like @fitzgen is happy with the refactorings we did together but I'll defer to him to make sure he's happy with the final state of this.

Thanks again for all the patience here!

fitzgen

Yep, let's merge this and incrementally continue improving it in follow ups.

@cfallin

This change includes quite a few interlocking parts, required mainly by the current x64 conventions in ISLE: - it adds a way to emit a `cmove` with multiple OR-ing conditions; because x64 ISLE cannot currently safely emit a comparison followed by several jumps, this adds `MachInst::CmoveOr` and `MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro instructions hide the multi-instruction sequence in `lower.isle` - to properly keep track of what instructions consume and produce flags, @cfallin added a way to pass around variants of `ConsumesFlags` and `ProducesFlags`--these changes affect all backends - then, to lower the `fcmp + select` CLIF, this change adds several `cmove*_from_values` helpers that perform all of the awkward conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one upside is that now these lowerings have much-improved documentation explaining why the various `FloatCC` and `CC` choices are made the the way they are. Co-authored-by: Chris Fallin <chris@cfallin.org>

The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in bytecodealliance#2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in bytecodealliance#3682. The existing fcmp-of-load test in `cmp-mem-bug` (from bytecodealliance#2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.

The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in #2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in #3682. The existing fcmp-of-load test in `cmp-mem-bug` (from #2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.

@cfallin

* x64: port `select` using an FP comparison to ISLE This change includes quite a few interlocking parts, required mainly by the current x64 conventions in ISLE: - it adds a way to emit a `cmove` with multiple OR-ing conditions; because x64 ISLE cannot currently safely emit a comparison followed by several jumps, this adds `MachInst::CmoveOr` and `MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro instructions hide the multi-instruction sequence in `lower.isle` - to properly keep track of what instructions consume and produce flags, @cfallin added a way to pass around variants of `ConsumesFlags` and `ProducesFlags`--these changes affect all backends - then, to lower the `fcmp + select` CLIF, this change adds several `cmove*_from_values` helpers that perform all of the awkward conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one upside is that now these lowerings have much-improved documentation explaining why the various `FloatCC` and `CC` choices are made the the way they are. Co-authored-by: Chris Fallin <chris@cfallin.org>

The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in bytecodealliance#2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in bytecodealliance#3682. The existing fcmp-of-load test in `cmp-mem-bug` (from bytecodealliance#2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.

The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in #2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in #3682. The existing fcmp-of-load test in `cmp-mem-bug` (from #2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.

abrown requested review from cfallin and fitzgen January 11, 2022 22:11

abrown force-pushed the isle-select branch from 3c8ca00 to 0c52d92 Compare January 11, 2022 22:13

github-actions Bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Jan 11, 2022

abrown commented Jan 11, 2022

View reviewed changes

Comment thread cranelift/filetests/filetests/isa/x64/cmp-mem-bug.clif Outdated

cfallin reviewed Jan 12, 2022

View reviewed changes

fitzgen reviewed Jan 12, 2022

View reviewed changes

abrown force-pushed the isle-select branch from 83b433f to 1cc1463 Compare January 27, 2022 23:43

abrown marked this pull request as ready for review January 29, 2022 01:38

abrown mentioned this pull request Jan 29, 2022

Cranelift: mov not removed when lowering select #3744

Closed

fitzgen reviewed Jan 31, 2022

View reviewed changes

Mrmaxmeier mentioned this pull request Feb 1, 2022

cranelift/x64: lower min and max for <= i64 #3748

Merged

github-actions Bot added the cranelift:area:aarch64 Issues related to AArch64 backend. label Feb 15, 2022

abrown force-pushed the isle-select branch 2 times, most recently from bcc3249 to dff48f9 Compare February 18, 2022 21:15

abrown requested review from cfallin and fitzgen February 18, 2022 21:39

cfallin approved these changes Feb 22, 2022

View reviewed changes

fitzgen approved these changes Feb 22, 2022

View reviewed changes

review: fix aarch64 flag use

2e809ec

abrown force-pushed the isle-select branch from dff48f9 to 2e809ec Compare February 23, 2022 01:51

review: fix s390x flag use

c6777da

abrown merged commit f87c611 into bytecodealliance:main Feb 23, 2022

abrown deleted the isle-select branch February 23, 2022 18:03

cfallin mentioned this pull request Mar 16, 2022

x64 backend: fix fpcmp to avoid load-op merging. #3934

Merged

ulan mentioned this pull request Apr 11, 2022

Register allocation panic in Wasmtime v0.35.2 #4010

Closed

		(let ((cons ValueRegs (put_in_regs consequent))
		(alt ValueRegs (put_in_regs alternative))

Conversation

abrown commented Jan 11, 2022

Uh oh!

abrown commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jan 11, 2022

Subscribe to Label Action

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrown commented Jan 15, 2022

Uh oh!

fitzgen commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abrown commented Jan 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrown commented Feb 1, 2022

Uh oh!

cfallin commented Feb 1, 2022

Uh oh!

abrown commented Feb 18, 2022

Uh oh!

cfallin left a comment

Choose a reason for hiding this comment

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abrown commented Jan 11, 2022 •

edited

Loading

fitzgen commented Jan 18, 2022 •

edited

Loading

abrown commented Jan 29, 2022 •

edited

Loading