Vector register endianness causing ABI incompatibility

Currently, `s390x` is the only big-endian platform supported by cranelift, which has already caused problems related to memory accesses in the past, since wasmtime defines all data in its memory to be in little-endian byte order.

I had assumed that *register* contents are not affected by byte order, and while this is true for scalar types, it turns out this is not actually the case for vector registers.  This is because of a combination of two issues:

1. There are IR instructions that directly refer to vector register _lanes_ by number; and
2. There are IR instructions to re-interpret a vector register as any other vector type, including one with different lane numbers

The combination of these two makes ISA byte order properties of vector registers visible to IR code, just like byte order in memory is visible via a combination of using memory addresses to access (parts of) a value in memory in a different type than it was stored.

Specifically, in the Intel (or ARM) little-endian ISA, if you load e.g. a `I32X4` into a vector register, use `raw_bitcast` to re-interpret that register as `I8X16`, and retrieve lane 0 of the resulting value via `extractlane`, the result will be the *least*-significant byte of the `I32` in lane 0 of the original value.   And in fact, wasmtime *requires* this to be the case.

On the other hand, on `s390x` the content of a vector register in `I32X4` mode will be a sequence of four `I32`, each in *big-endian* byte order (or else the arithmetic operations on the register will give wrong results).   Therefore, the same sequence of operations as above will return the *most*-significant byte of the `I32` in lane 0.   This actually caused many wasmtime tests to fail.

---

To fix this, my current implementation of SIMD instructions on `s390x` uses a trick combining two different aspects:
- When loading the `I32X4` from little-endian memory into a vector register, I'm byte-reversing all 16 bytes of the loaded value.  This not only fixes each `I32` value to be in the correct big-endian order in the register so subsequent arithmetic will work, it *also* implicitly swaps the order of elements, i.e. the element in slot 0 in memory will end up in what the ISA considers slot 3 of the register etc.
- The implementation of all IR instructions that uses explicit lane numbers will be aware of this renumbering, and implicitly revert it to get back to the lanes the code intends to access, so e.g. using `extractlane` for lane 0 of a `I32X4` will actually at the ISA level extract lane 3 of the register.

The combination of these two aspects makes accessing SIMD registers work correctly for wasmtime.  For example, in the above case, accessing lane 0 of a `I8X16` is converted to lane 15 of the register, which holds the *least*-significant byte of the `I32` in lane 3 of the register, which was loaded from lane 0 in memory -- so in the end we return the least-significant byte of the `I32` in lane 0 of the original value, as expected by wasmtime.

---

However, in implementing support for `rustc_codegen_cranelift`, I noticed that the current implementation actually breaks when handling *big-endian* vectors in memory - this will be the case for rustc, since that uses native platform byte order everywhere.  Specifically, when loading a big-endian vector from memory, I'm just loading the bytes unchanged.  This means that e.g. lane 0 of a `I32X4` in memory ends up in its original byte order (which is OK since this is already big-endian) in lane 0 of the register - but the latter is a problem if subsequent code wants to extract that lane, since an `extractlane 0` will actually access lane 3 in the register as described above!

To work around *that* problem, I've implemented a patch that will perform big-endian vector loads by swapping the order of the *elements*, but not the byte order within any single element.  This will cause lane 0 from memory to end up in lane 3 in the register, and makes `extractlane` work as expected again.

With that fix, now *both* wasmtime and rustc_codegen_cranelift pass all their existing SIMD tests.   Unfortunately, I think this is still not quite a complete solution.

---

Specifically, we can now run into two additional problems with big-endian code, which apparently are just never triggered by the existing rustc_codegen_cranelift tests.

First, I guess it would be possible to re-interpret contents in a vector register in another type even in rustc.   Now, as opposed to wasmtime, rustc uses native endianness, presumably also w.r.t. vector contents.   Therefore, the semantics of such a re-interpretation would be platform-defined and differ between big- and little-endian platforms (which is probably why it's not frequently used).  However, users would expect this platform-specific behavior to be the *same* between the LLVM and cranelift back ends to rustc - which in the current implementation it would not be.

Even more problematic, carrying vector elements in reverse order in vector registers actually affects the *ABI*, as vector types are passed in vector registers.   Code compiled by rustc using the LLVM back end would expect those to be in the "normal" platform order, while code compiled by rustc using the cranelift back end would expect them to be the "reverse" order.

---

One option I'm thinking of would be to actually implement *both* methods in the cranelift back end.  Specifically, the back end could support both a "vector big-endian" and "vector little-endian" mode, where the "big-endian" mode would use lane numbers directly as defined by our ISA, while the "little-endian" mode would use the reverse ordering implemented by the current back end code.

There's a slight complication in that we might have to support both big- and little-endian vector modes in load and store operations accessing any combination of big- and little-endian memory locations.  But that should be possible:
```
vector mode      memory byte order    load/store operation
big-endian       big-endian           direct   
big-endian       little-endian        byte-reverse each element
little-endian    big-endian           reverse order of elements 
little-endian    little-endian        byte-reverse 16-byte input
```
(Starting with `z15` we actually can implement each of these operations using a single instruction, so that should also be efficient.)

The one remaining question is, how to select between the "vector big-endian" and "vector little-endian" modes?   There are no attributes (like `MemFlags`) on the `extractlane` etc. operations, and that wouldn't even make sense: this is a global property, if you used big-endian mode to load the vector you must also use big-endian mode on all subsequent operations.

So I guess this would have to be some sort of global flag, which wasmtime would set to always little-endian, and rustc would leave at native byte order.   Of course, this flag is actually ABI changing, so the same setting must be used for code to remain link-compatible.  But I think given the above use cases, that should actually be fine.

---

FYI @cfallin - this is another problem I ran into while enabling `rustc_codegen_cranelift`  - I'd appreciate any comments or suggestions!




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector register endianness causing ABI incompatibility #4566

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vector register endianness causing ABI incompatibility #4566

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions