Implement the relaxed SIMD proposal#5892
Conversation
|
This is a substantial bit of work (regardless of the "small" proposal) -- thanks for this! I have some thoughts on the Cranelift side below. To preface these thoughts: this is a good change, and we need to support this proposal. However, determinism has been a significant discussion point in CLIF and has significant implications for testing and verification, so I want to suggest that we might consider some tweaks to the approach. So:
Skimming the added instructions' definitions, my main concern is basically a mirror of the determinism concern at the Wasm standards discussion level. In CLIF, we've had a lot of discussion around determism as it relates to endianness (#3369 and subsequent discussions and PRs), which became especially relevant as we built the CLIF interpreter and started to fuzz against it. If we let CLIF semantics be partly defined by architecture-specific details, then we can't possibly define an interpreter behavior that matches all backends. Our solution in the endianness case was to reify endian as an explicit detail in the IR, and let loads/stores be tagged as such. Ultimately in that case one can support big+little endian on all machines with the help of bswaps (and s390x actually does, so Wasm lowerings work), but that's a separate and additional step no different from any other "support operand combinations" enhancement. This is important for verification efforts too: we shouldn't have CLIF whose meaning depends on the compilation target. We could adopt a refinement semantics approach ("result may be one of the values in this set" and then compiled code refines those sets) but that's harder to reason about in general. I think a similar approach as we had with endianness might work here: define a Cranelift setting that is something like " There are some issues with that -- namely, we now need to plumb "expected native settings for target T" up to the CLIF producer, and the actual CLIF will now differ on x86-64 and aarch64 for programs that use these opcodes -- but the semantics are at least defined fully and machine-independently for any CLIF. Thoughts? Also, the Cranelift meeting on Wednesday would probably be a good venue to get a wider set of thoughts on this :-) |
|
(An addendum to the above: in the endianness case we have a third endian, "native", which lets us avoid the plumbing problem of "which endian is faster / even supported" in the CLIF producer; but IMHO that's a bit of a wart, which I'd like to remove eventually for all of the reasons above, to fully capture the meaning in the IR. So I'd prefer we bite the bullet and plumb this through with settings and "machine info" of some sort.) |
|
If we want to offer the option of using the deterministic semantics for the relaxed SIMD proposal anyway, what if we start by shipping only that version? Then we can support the new wasm instructions now, while taking some more time to think through how best to expose architecture-specific instructions in Cranelift. |
Subscribe to Label Actioncc @peterhuene DetailsThis issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:x64", "cranelift:meta", "cranelift:wasm", "wasmtime:api", "wasmtime:config"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
Label Messager: wasmtime:configIt looks like you are changing Wasmtime's configuration options. Make sure to
DetailsTo modify this label's message, edit the To add new label messages or remove existing label messages, edit the |
|
Definitely happy to discuss more at a Cranelift meeting! Sorry should have clarified that in the PR description, I figured the precise method of implementing this would be contentious and want to emphasize that this current iteration is just one point in the design space and I'm happy to adjust as preferred. My motivation for working on this was I wanted to add the instructions to the Rust standard library, but doing that required having a runtime that implements them, and since Wasmtime is currently used to validate simd intrinsics for Rust I figured I may as well give it a stab to implement here. In that sense I'm in no rush to get something in, and I would prefer to not land anything unless there's a path forward to getting the fast lowerings of these instructions since I suspect it's not really that useful of a proposal if only the deterministic lowerings are offered. One alternative is to basically just implement new Another alternative would be to take inspiration which has intrinsics of the form These alternatives, as noted, do indeed lift the complexity into the producer but that's not really the end of the world if the main producer here for these instructions is Wasmtime. |
I actually kind of like this too, weirdly enough: it's dead-simple, and it doesn't create instructions whose semantics are "parameterized" on a Cranelift setting. If anyone were perplexed by "x86" appearing in a machine-independent IR, we could expand it to |
|
A WebAssembly module can't select which code path to use based on which wasm extensions are available in the host, right? In that case I think there is value in having any implementation at all of the relaxed SIMD instructions, even if for some reason we never get around to making them fast on x86, so we can at least run modules that use them. I also think it's good to merge the uncontroversial parts of this first while we sort out the path to making it fast, so later we're iterating on a smaller PR. |
7d4c0bf to
8c81699
Compare
|
I've split out #5895, #5896, and #5897, rebased this PR on top of those, and then split the bulk into two commits -- the first adds all the scaffolding/support/deterministic lowerings and the second adds all the x86-specific goop. I went ahead and switched to |
72cd658 to
94f85d0
Compare
abrown
left a comment
There was a problem hiding this comment.
Makes sense to me; as I mentioned today it is unfortunate that there is any special handling needed for x86 (i.e., is_x86()) but it's not really clear what else can be done. Nice work pulling this together, @alexcrichton! I had some suggestions that may or may not be necessary, and it might be good for someone else to put eyes on this as well.
| Operator::I16x8DotI8x16I7x16S => { | ||
| let (a, b) = pop2_with_bitcast(state, I8X16, builder); | ||
| state.push1( | ||
| if environ.relaxed_simd_deterministic() || !environ.is_x86() { |
There was a problem hiding this comment.
Maybe? (same below):
| if environ.relaxed_simd_deterministic() || !environ.is_x86() { | |
| if !environ.relaxed_simd_deterministic() || !environ.is_x86() { |
There was a problem hiding this comment.
This is currently intentional where if determinism is required it takes the first branch which has the deterministic lowering in terms of CLIF instruction semantics. The branch isn't taken though for x86 when speed is desired b/c x86 has a faster lowering.
If you have a suggestion though to restructure this to more obviously communicate this intent it's probably good to implement though since the current if is pretty subtle and I think easy to lose track of.
There was a problem hiding this comment.
Ok, makes sense. In terms of communication, I guess the missing piece of information for me was that I couldn't remember off the top of my head which lowering had the deterministic semantics. Now, obviously, I should have trusted that the if environ.relaxed_simd_deterministic() side of the branch was the deterministic lowering but in this review I was starting to wonder which lowering was which. This may just be a review issue, but if you wanted to make it extra clear the deterministic lowering could get a comment like: // This is the deterministic lowering for ..., see https://github.com/WebAssembly/relaxed-simd/....
There was a problem hiding this comment.
I think the spec isn't quite in the state where there's a URL to link to unfortunately. I don't think that the deterministic semantics are actually written down anywhere other than the spec draft currently. I'll try to remember to update things here though once that's update.
Otherwise though I went ahead and added comments to all the deterministic cases at least english-explaining what the intention was.
|
@cfallin or @jameysharp did y'all want to take another pass over this? If not I can go ahead and hit the button too |
cfallin
left a comment
There was a problem hiding this comment.
LGTM on the Cranelift bits; I'm happy with this approach. Thanks!
jameysharp
left a comment
There was a problem hiding this comment.
I'm happy with this, with a few small details fixed. Thanks!
| (v128.const i16x8 -32768 -32767 32767 0 0 0 0 0) | ||
| (v128.const i16x8 -32768 -32768 32767 0 0 0 0 0)) | ||
| ;; overflows, return either INT16_MIN or INT16_MAX | ||
| (v128.const i16x8 -1 -1 -1 -1 -1 -1 -1 -1)) |
There was a problem hiding this comment.
This comment seems wrong since only one possible outcome is asserted.
There was a problem hiding this comment.
Ah this is just copied from the upstream spec which I think is just a copy/paste artifact, but if you'd like I think they're pretty accepting of PRs.
This commit adds initial scaffolding and support for the Relaxed SIMD proposal for WebAssembly. Codegen support is supported on the x64 and AArch64 backends on this time. The purpose of this commit is to get all the boilerplate out of the way in terms of plumbing through a new feature, adding tests, etc. The tests are copied from the upstream repository at this time while the WebAssembly/testsuite repository hasn't been updated. A summary of changes made in this commit are: * Lowerings for all relaxed simd opcodes have been added, currently all exhibiting deterministic behavior. This means that few lowerings are optimal on the x86 backend, but on the AArch64 backend, for example, all lowerings should be optimal. * Support is added to codegen to, eventually, conditionally generate different code based on input codegen flags. This is intended to enable codegen to more efficient instructions on x86 by default, for example, while still allowing embedders to force architecture-independent semantics and behavior. One good example of this is the `f32x4.relaxed_fmadd` instruction which when deterministic forces the `fma` instruction, but otherwise if the backend doesn't have support for `fma` then intermediate operations are performed instead. * Lowerings of `iadd_pairwise` for `i16x8` and `i32x4` were added to the x86 backend as they're now exercised by the deterministic lowerings of relaxed simd instructions. * Sample codegen tests for added for x86 and aarch64 for some relaxed simd instructions. * Wasmtime embedder support for the relaxed-simd proposal and forcing determinism have been added to `Config` and the CLI. * Support has been added to the `*.wast` runtime execution for the `(either ...)` matcher used in the relaxed-simd proposal. * Tests for relaxed-simd are run both with a default `Engine` as well as a "force deterministic" `Engine` to test both configurations. * All tests from the upstream repository were copied into Wasmtime. These tests should be deleted when WebAssembly/testsuite is updated.
This commit builds on the prior commit and adds an array of `x86_*` instructions to Cranelift which have semantics that match their corresponding x86 equivalents. Translation for relaxed simd is then additionally updated to conditionally generate different CLIF for relaxed simd instructions depending on whether the target is x86 or not. This means that for AArch64 no changes are made but for x86 most relaxed instructions now lower to some x86-equivalent with slightly different semantics than the "deterministic" lowering.
This will be required to implement the `f32x4.relaxed_madd` instruction (and others) when an x86 host doesn't specify the `has_fma` feature.
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
2bba800 to
d72e6ee
Compare
This commit is an implementation of the [relaxed SIMD proposal][proposal] which is expected to soon move to phase 4. This is a relatively small proposal which adds 20 new instructions operating over the preexisting
v128type. The purpose of this proposal is to expose instructions that are not guaranteed to be deterministic across host platforms. For example the instructions added have different behavior for some inputs on x86_64 and aarch64. I believe the intention is that many real-world use cases for these instructions have inputs which "avoid the edge cases" to have the same behavior across platforms.Additionally as part of the specification of the relaxed SIMD proposal, however, there are annotated deterministic semantics for each instruction. In other words while each instruction is allowed to have different semantics across platforms there is still one listed canonical set of semantics. This commit implements this knobs as a
Configoption in Wasmtime. By default all relaxed simd instructions use their host-defined semantics (the most efficient ones) but a newConfig::relaxed_simd_deterministicknob can be used to alter the codegen and force deterministic semantics.An example of this is that the
f32x4.relaxed_maddinstruction, a fused multiply and add instruction, can either be a fused operation or separate operations. By default Wasmtime will select the most efficient one, namely using two operations on x86_64 when thefmafeature isn't available. If therelaxed_simd_deterministicoption is enabled, however, then regardless of whetherfmais available the fused semantics will be implemented as they're the deterministic semantics for this instruction.This commit includes an implementation of these instructions for the x64 and the aarch64 backend. The aarch64 implementation ended up being trivial since all of its instructions match the deterministic semantics of the proposal. Only support for encoding the
fmlsinstruction was added since it wasn't already implemented. On x86_64 the implementation is much more subtle and exploits various edge cases in the proposal (leading me to the conclusion that the relaxed simd proposal is basically the "let's acknowledge the nonstandard, but dominant, semantics of x86_64 proposal").Implementation-wise this commit adds a number of new instructions to Cranelift all prefixed with
arch_*. I've tried to document that the intended equivalence to a deterministic instruction and additionally document the edge cases (currently they've all got edge cases for x86_64 and none for aarch64, which I believe represents the current state of things). There may be better methods to do this, however, since this strategy does not seems scalable into the future if dozens-to-hundreds more instructions get proposed and implemented.There are a number of more optimal AVX512-based lowerings for the x86_64 backend for these instructions, but I haven't implemented any of them in this PR. All the support for x86_64 is based on AVX and below.
Procedurally the spec tests were copied from the upstream proposal into the Wasmtime repository here since the WebAssembly/testsuite repo is somewhat inactive right now and it was the easiest way to integrate them. When the upstream repository is updated the local copies here can be deleted in favor of the upstream implementations.
Finally I have not done rigorous testing of this PR yet. I've tried to be meticulous with rationalizing the spec-defined semantics of these wasm instructions against the architecture-specific behavior of each instruction. This has resulted in a number of PRs to the upstream specification to align with the intended instructions or clarify some certain lowerings. For example x86_64's lowering of
i16x8.relaxed_laneselectis not correct with respect to the current specification but there is an issue open for this upstream. There is unfortunately no great way to fuzz these instructions inherently due to their nondeterministic semantics. Without adifferentialfuzzer all we can really do is verify that they compile which doesn't really provide much of a guarantee. My hope is that this can be considered more carefully before any decision to enable this proposal by default.