Skip to content

Implement the relaxed SIMD proposal#5892

Merged
alexcrichton merged 10 commits into
bytecodealliance:mainfrom
alexcrichton:relaxed-simd-maybe
Mar 7, 2023
Merged

Implement the relaxed SIMD proposal#5892
alexcrichton merged 10 commits into
bytecodealliance:mainfrom
alexcrichton:relaxed-simd-maybe

Conversation

@alexcrichton
Copy link
Copy Markdown
Member

This commit is an implementation of the [relaxed SIMD proposal][proposal] which is expected to soon move to phase 4. This is a relatively small proposal which adds 20 new instructions operating over the preexisting v128 type. The purpose of this proposal is to expose instructions that are not guaranteed to be deterministic across host platforms. For example the instructions added have different behavior for some inputs on x86_64 and aarch64. I believe the intention is that many real-world use cases for these instructions have inputs which "avoid the edge cases" to have the same behavior across platforms.

Additionally as part of the specification of the relaxed SIMD proposal, however, there are annotated deterministic semantics for each instruction. In other words while each instruction is allowed to have different semantics across platforms there is still one listed canonical set of semantics. This commit implements this knobs as a Config option in Wasmtime. By default all relaxed simd instructions use their host-defined semantics (the most efficient ones) but a new Config::relaxed_simd_deterministic knob can be used to alter the codegen and force deterministic semantics.

An example of this is that the f32x4.relaxed_madd instruction, a fused multiply and add instruction, can either be a fused operation or separate operations. By default Wasmtime will select the most efficient one, namely using two operations on x86_64 when the fma feature isn't available. If the relaxed_simd_deterministic option is enabled, however, then regardless of whether fma is available the fused semantics will be implemented as they're the deterministic semantics for this instruction.

This commit includes an implementation of these instructions for the x64 and the aarch64 backend. The aarch64 implementation ended up being trivial since all of its instructions match the deterministic semantics of the proposal. Only support for encoding the fmls instruction was added since it wasn't already implemented. On x86_64 the implementation is much more subtle and exploits various edge cases in the proposal (leading me to the conclusion that the relaxed simd proposal is basically the "let's acknowledge the nonstandard, but dominant, semantics of x86_64 proposal").

Implementation-wise this commit adds a number of new instructions to Cranelift all prefixed with arch_*. I've tried to document that the intended equivalence to a deterministic instruction and additionally document the edge cases (currently they've all got edge cases for x86_64 and none for aarch64, which I believe represents the current state of things). There may be better methods to do this, however, since this strategy does not seems scalable into the future if dozens-to-hundreds more instructions get proposed and implemented.

There are a number of more optimal AVX512-based lowerings for the x86_64 backend for these instructions, but I haven't implemented any of them in this PR. All the support for x86_64 is based on AVX and below.

Procedurally the spec tests were copied from the upstream proposal into the Wasmtime repository here since the WebAssembly/testsuite repo is somewhat inactive right now and it was the easiest way to integrate them. When the upstream repository is updated the local copies here can be deleted in favor of the upstream implementations.

Finally I have not done rigorous testing of this PR yet. I've tried to be meticulous with rationalizing the spec-defined semantics of these wasm instructions against the architecture-specific behavior of each instruction. This has resulted in a number of PRs to the upstream specification to align with the intended instructions or clarify some certain lowerings. For example x86_64's lowering of i16x8.relaxed_laneselect is not correct with respect to the current specification but there is an issue open for this upstream. There is unfortunately no great way to fuzz these instructions inherently due to their nondeterministic semantics. Without a differential fuzzer all we can really do is verify that they compile which doesn't really provide much of a guarantee. My hope is that this can be considered more carefully before any decision to enable this proposal by default.

@cfallin
Copy link
Copy Markdown
Member

cfallin commented Feb 27, 2023

This is a substantial bit of work (regardless of the "small" proposal) -- thanks for this!

I have some thoughts on the Cranelift side below. To preface these thoughts: this is a good change, and we need to support this proposal. However, determinism has been a significant discussion point in CLIF and has significant implications for testing and verification, so I want to suggest that we might consider some tweaks to the approach. So:

Implementation-wise this commit adds a number of new instructions to Cranelift all prefixed with arch_*. I've tried to document that the intended equivalence to a deterministic instruction and additionally document the edge cases (currently they've all got edge cases for x86_64 and none for aarch64, which I believe represents the current state of things). There may be better methods to do this, however, since this strategy does not seems scalable into the future if dozens-to-hundreds more instructions get proposed and implemented.

Skimming the added instructions' definitions, my main concern is basically a mirror of the determinism concern at the Wasm standards discussion level. In CLIF, we've had a lot of discussion around determism as it relates to endianness (#3369 and subsequent discussions and PRs), which became especially relevant as we built the CLIF interpreter and started to fuzz against it. If we let CLIF semantics be partly defined by architecture-specific details, then we can't possibly define an interpreter behavior that matches all backends. Our solution in the endianness case was to reify endian as an explicit detail in the IR, and let loads/stores be tagged as such. Ultimately in that case one can support big+little endian on all machines with the help of bswaps (and s390x actually does, so Wasm lowerings work), but that's a separate and additional step no different from any other "support operand combinations" enhancement.

This is important for verification efforts too: we shouldn't have CLIF whose meaning depends on the compilation target. We could adopt a refinement semantics approach ("result may be one of the values in this set" and then compiled code refines those sets) but that's harder to reason about in general.

I think a similar approach as we had with endianness might work here: define a Cranelift setting that is something like "arch_* behavior", or more granular settings for each ("insertlane behavior", ...), set them as appropriate for the expected target, and implement the appropriate lowerings (and all lowerings in the interpreter).

There are some issues with that -- namely, we now need to plumb "expected native settings for target T" up to the CLIF producer, and the actual CLIF will now differ on x86-64 and aarch64 for programs that use these opcodes -- but the semantics are at least defined fully and machine-independently for any CLIF.

Thoughts? Also, the Cranelift meeting on Wednesday would probably be a good venue to get a wider set of thoughts on this :-)

@cfallin
Copy link
Copy Markdown
Member

cfallin commented Feb 27, 2023

(An addendum to the above: in the endianness case we have a third endian, "native", which lets us avoid the plumbing problem of "which endian is faster / even supported" in the CLIF producer; but IMHO that's a bit of a wart, which I'd like to remove eventually for all of the reasons above, to fully capture the meaning in the IR. So I'd prefer we bite the bullet and plumb this through with settings and "machine info" of some sort.)

@jameysharp
Copy link
Copy Markdown
Contributor

If we want to offer the option of using the deterministic semantics for the relaxed SIMD proposal anyway, what if we start by shipping only that version? Then we can support the new wasm instructions now, while taking some more time to think through how best to expose architecture-specific instructions in Cranelift.

@github-actions github-actions Bot added cranelift Issues related to the Cranelift code generator cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift:wasm wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:config Issues related to the configuration of Wasmtime labels Feb 27, 2023
@github-actions
Copy link
Copy Markdown

Subscribe to Label Action

cc @peterhuene

Details This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:x64", "cranelift:meta", "cranelift:wasm", "wasmtime:api", "wasmtime:config"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@github-actions
Copy link
Copy Markdown

Label Messager: wasmtime:config

It looks like you are changing Wasmtime's configuration options. Make sure to
complete this check list:

  • If you added a new Config method, you wrote extensive documentation for
    it.

    Details

    Our documentation should be of the following form:

    Short, simple summary sentence.
    
    More details. These details can be multiple paragraphs. There should be
    information about not just the method, but its parameters and results as
    well.
    
    Is this method fallible? If so, when can it return an error?
    
    Can this method panic? If so, when does it panic?
    
    # Example
    
    Optional example here.
    
  • If you added a new Config method, or modified an existing one, you
    ensured that this configuration is exercised by the fuzz targets.

    Details

    For example, if you expose a new strategy for allocating the next instance
    slot inside the pooling allocator, you should ensure that at least one of our
    fuzz targets exercises that new strategy.

    Often, all that is required of you is to ensure that there is a knob for this
    configuration option in wasmtime_fuzzing::Config (or one
    of its nested structs).

    Rarely, this may require authoring a new fuzz target to specifically test this
    configuration. See our docs on fuzzing for more details.

  • If you are enabling a configuration option by default, make sure that it
    has been fuzzed for at least two weeks before turning it on by default.


Details

To modify this label's message, edit the .github/label-messager/wasmtime-config.md file.

To add new label messages or remove existing label messages, edit the
.github/label-messager.json configuration file.

Learn more.

@alexcrichton
Copy link
Copy Markdown
Member Author

Definitely happy to discuss more at a Cranelift meeting! Sorry should have clarified that in the PR description, I figured the precise method of implementing this would be contentious and want to emphasize that this current iteration is just one point in the design space and I'm happy to adjust as preferred. My motivation for working on this was I wanted to add the instructions to the Rust standard library, but doing that required having a runtime that implements them, and since Wasmtime is currently used to validate simd intrinsics for Rust I figured I may as well give it a stab to implement here. In that sense I'm in no rush to get something in, and I would prefer to not land anything unless there's a path forward to getting the fast lowerings of these instructions since I suspect it's not really that useful of a proposal if only the deterministic lowerings are offered.


One alternative is to basically just implement new x86_* instructions for CLIF which aren't implemented for any backend other than x86 (unless someone really really wants to do that) and are defined in terms of the actual x86 instructions. I don't know how scalable this is though in the sense that this PR only implements AArch64 and x86 lowerings for the relaxed simd instructions and while I don't believe any aarch64_* instructions would be necessary I don't know if CLIF would need to grow riscv64_* or s390x_* instructions.

Another alternative would be to take inspiration which has intrinsics of the form @llvm.x86.* where any backend could have any number of intrinsics which are well-defined for each backend. CLIF AFAIK isn't really set up for this right now, though, but would provide more easy extensibility while retaining "everything can in theory be implemented on every platform for every other platform simultaneously".

These alternatives, as noted, do indeed lift the complexity into the producer but that's not really the end of the world if the main producer here for these instructions is Wasmtime.

@cfallin
Copy link
Copy Markdown
Member

cfallin commented Feb 28, 2023

One alternative is to basically just implement new x86_* instructions for CLIF which aren't implemented for any backend other than x86 (unless someone really really wants to do that) and are defined in terms of the actual x86 instructions. I don't know how scalable this is though in the sense that this PR only implements AArch64 and x86 lowerings for the relaxed simd instructions and while I don't believe any aarch64_* instructions would be necessary I don't know if CLIF would need to grow riscv64_* or s390x_* instructions.

I actually kind of like this too, weirdly enough: it's dead-simple, and it doesn't create instructions whose semantics are "parameterized" on a Cranelift setting. If anyone were perplexed by "x86" appearing in a machine-independent IR, we could expand it to x86like_insertlane or whatever -- the ISA name is the inspiration for the semantics but the semantics are machine-independent once we define this.

@jameysharp
Copy link
Copy Markdown
Contributor

A WebAssembly module can't select which code path to use based on which wasm extensions are available in the host, right? In that case I think there is value in having any implementation at all of the relaxed SIMD instructions, even if for some reason we never get around to making them fast on x86, so we can at least run modules that use them. I also think it's good to merge the uncontroversial parts of this first while we sort out the path to making it fast, so later we're iterating on a smaller PR.

@alexcrichton
Copy link
Copy Markdown
Member Author

I've split out #5895, #5896, and #5897, rebased this PR on top of those, and then split the bulk into two commits -- the first adds all the scaffolding/support/deterministic lowerings and the second adds all the x86-specific goop. I went ahead and switched to x86_* style instructions in CLIF, but I'd still like to discuss this in tomorrow's Cranelift meeting.

Copy link
Copy Markdown
Member

@abrown abrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me; as I mentioned today it is unfortunate that there is any special handling needed for x86 (i.e., is_x86()) but it's not really clear what else can be done. Nice work pulling this together, @alexcrichton! I had some suggestions that may or may not be necessary, and it might be good for someone else to put eyes on this as well.

Comment thread cranelift/codegen/meta/src/shared/instructions.rs Outdated
Comment thread cranelift/wasm/src/code_translator.rs
Comment thread cranelift/wasm/src/code_translator.rs
Operator::I16x8DotI8x16I7x16S => {
let (a, b) = pop2_with_bitcast(state, I8X16, builder);
state.push1(
if environ.relaxed_simd_deterministic() || !environ.is_x86() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? (same below):

Suggested change
if environ.relaxed_simd_deterministic() || !environ.is_x86() {
if !environ.relaxed_simd_deterministic() || !environ.is_x86() {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently intentional where if determinism is required it takes the first branch which has the deterministic lowering in terms of CLIF instruction semantics. The branch isn't taken though for x86 when speed is desired b/c x86 has a faster lowering.

If you have a suggestion though to restructure this to more obviously communicate this intent it's probably good to implement though since the current if is pretty subtle and I think easy to lose track of.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense. In terms of communication, I guess the missing piece of information for me was that I couldn't remember off the top of my head which lowering had the deterministic semantics. Now, obviously, I should have trusted that the if environ.relaxed_simd_deterministic() side of the branch was the deterministic lowering but in this review I was starting to wonder which lowering was which. This may just be a review issue, but if you wanted to make it extra clear the deterministic lowering could get a comment like: // This is the deterministic lowering for ..., see https://github.com/WebAssembly/relaxed-simd/....

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spec isn't quite in the state where there's a URL to link to unfortunately. I don't think that the deterministic semantics are actually written down anywhere other than the spec draft currently. I'll try to remember to update things here though once that's update.

Otherwise though I went ahead and added comments to all the deterministic cases at least english-explaining what the intention was.

Comment thread build.rs
@alexcrichton
Copy link
Copy Markdown
Member Author

@cfallin or @jameysharp did y'all want to take another pass over this? If not I can go ahead and hit the button too

Copy link
Copy Markdown
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the Cranelift bits; I'm happy with this approach. Thanks!

Copy link
Copy Markdown
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this, with a few small details fixed. Thanks!

Comment thread cranelift/codegen/src/isa/mod.rs Outdated
Comment thread crates/wasmtime/src/config.rs Outdated
(v128.const i16x8 -32768 -32767 32767 0 0 0 0 0)
(v128.const i16x8 -32768 -32768 32767 0 0 0 0 0))
;; overflows, return either INT16_MIN or INT16_MAX
(v128.const i16x8 -1 -1 -1 -1 -1 -1 -1 -1))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems wrong since only one possible outcome is asserted.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is just copied from the upstream spec which I think is just a copy/paste artifact, but if you'd like I think they're pretty accepting of PRs.

alexcrichton and others added 7 commits March 7, 2023 07:10
This commit adds initial scaffolding and support for the Relaxed SIMD
proposal for WebAssembly. Codegen support is supported on the x64 and
AArch64 backends on this time.

The purpose of this commit is to get all the boilerplate out of the way
in terms of plumbing through a new feature, adding tests, etc. The tests
are copied from the upstream repository at this time while the
WebAssembly/testsuite repository hasn't been updated.

A summary of changes made in this commit are:

* Lowerings for all relaxed simd opcodes have been added, currently all
  exhibiting deterministic behavior. This means that few lowerings are
  optimal on the x86 backend, but on the AArch64 backend, for example,
  all lowerings should be optimal.

* Support is added to codegen to, eventually, conditionally generate
  different code based on input codegen flags. This is intended to
  enable codegen to more efficient instructions on x86 by default, for
  example, while still allowing embedders to force
  architecture-independent semantics and behavior. One good example of
  this is the `f32x4.relaxed_fmadd` instruction which when deterministic
  forces the `fma` instruction, but otherwise if the backend doesn't
  have support for `fma` then intermediate operations are performed
  instead.

* Lowerings of `iadd_pairwise` for `i16x8` and `i32x4` were added to the
  x86 backend as they're now exercised by the deterministic lowerings of
  relaxed simd instructions.

* Sample codegen tests for added for x86 and aarch64 for some relaxed
  simd instructions.

* Wasmtime embedder support for the relaxed-simd proposal and forcing
  determinism have been added to `Config` and the CLI.

* Support has been added to the `*.wast` runtime execution for the
  `(either ...)` matcher used in the relaxed-simd proposal.

* Tests for relaxed-simd are run both with a default `Engine` as well as
  a "force deterministic" `Engine` to test both configurations.

* All tests from the upstream repository were copied into Wasmtime.
  These tests should be deleted when WebAssembly/testsuite is updated.
This commit builds on the prior commit and adds an array of `x86_*`
instructions to Cranelift which have semantics that match their
corresponding x86 equivalents. Translation for relaxed simd is then
additionally updated to conditionally generate different CLIF for
relaxed simd instructions depending on whether the target is x86 or not.
This means that for AArch64 no changes are made but for x86 most relaxed
instructions now lower to some x86-equivalent with slightly different
semantics than the "deterministic" lowering.
This will be required to implement the `f32x4.relaxed_madd` instruction
(and others) when an x86 host doesn't specify the `has_fma` feature.
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
@alexcrichton alexcrichton enabled auto-merge March 7, 2023 15:20
@alexcrichton alexcrichton added this pull request to the merge queue Mar 7, 2023
Merged via the queue into bytecodealliance:main with commit 8bb183f Mar 7, 2023
@alexcrichton alexcrichton deleted the relaxed-simd-maybe branch March 7, 2023 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift:wasm cranelift Issues related to the Cranelift code generator wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:config Issues related to the configuration of Wasmtime

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants