Currently, s390x is the only big-endian platform supported by cranelift, which has already caused problems related to memory accesses in the past, since wasmtime defines all data in its memory to be in little-endian byte order.
I had assumed that register contents are not affected by byte order, and while this is true for scalar types, it turns out this is not actually the case for vector registers. This is because of a combination of two issues:
- There are IR instructions that directly refer to vector register lanes by number; and
- There are IR instructions to re-interpret a vector register as any other vector type, including one with different lane numbers
The combination of these two makes ISA byte order properties of vector registers visible to IR code, just like byte order in memory is visible via a combination of using memory addresses to access (parts of) a value in memory in a different type than it was stored.
Specifically, in the Intel (or ARM) little-endian ISA, if you load e.g. a I32X4 into a vector register, use raw_bitcast to re-interpret that register as I8X16, and retrieve lane 0 of the resulting value via extractlane, the result will be the least-significant byte of the I32 in lane 0 of the original value. And in fact, wasmtime requires this to be the case.
On the other hand, on s390x the content of a vector register in I32X4 mode will be a sequence of four I32, each in big-endian byte order (or else the arithmetic operations on the register will give wrong results). Therefore, the same sequence of operations as above will return the most-significant byte of the I32 in lane 0. This actually caused many wasmtime tests to fail.
To fix this, my current implementation of SIMD instructions on s390x uses a trick combining two different aspects:
- When loading the
I32X4 from little-endian memory into a vector register, I'm byte-reversing all 16 bytes of the loaded value. This not only fixes each I32 value to be in the correct big-endian order in the register so subsequent arithmetic will work, it also implicitly swaps the order of elements, i.e. the element in slot 0 in memory will end up in what the ISA considers slot 3 of the register etc.
- The implementation of all IR instructions that uses explicit lane numbers will be aware of this renumbering, and implicitly revert it to get back to the lanes the code intends to access, so e.g. using
extractlane for lane 0 of a I32X4 will actually at the ISA level extract lane 3 of the register.
The combination of these two aspects makes accessing SIMD registers work correctly for wasmtime. For example, in the above case, accessing lane 0 of a I8X16 is converted to lane 15 of the register, which holds the least-significant byte of the I32 in lane 3 of the register, which was loaded from lane 0 in memory -- so in the end we return the least-significant byte of the I32 in lane 0 of the original value, as expected by wasmtime.
However, in implementing support for rustc_codegen_cranelift, I noticed that the current implementation actually breaks when handling big-endian vectors in memory - this will be the case for rustc, since that uses native platform byte order everywhere. Specifically, when loading a big-endian vector from memory, I'm just loading the bytes unchanged. This means that e.g. lane 0 of a I32X4 in memory ends up in its original byte order (which is OK since this is already big-endian) in lane 0 of the register - but the latter is a problem if subsequent code wants to extract that lane, since an extractlane 0 will actually access lane 3 in the register as described above!
To work around that problem, I've implemented a patch that will perform big-endian vector loads by swapping the order of the elements, but not the byte order within any single element. This will cause lane 0 from memory to end up in lane 3 in the register, and makes extractlane work as expected again.
With that fix, now both wasmtime and rustc_codegen_cranelift pass all their existing SIMD tests. Unfortunately, I think this is still not quite a complete solution.
Specifically, we can now run into two additional problems with big-endian code, which apparently are just never triggered by the existing rustc_codegen_cranelift tests.
First, I guess it would be possible to re-interpret contents in a vector register in another type even in rustc. Now, as opposed to wasmtime, rustc uses native endianness, presumably also w.r.t. vector contents. Therefore, the semantics of such a re-interpretation would be platform-defined and differ between big- and little-endian platforms (which is probably why it's not frequently used). However, users would expect this platform-specific behavior to be the same between the LLVM and cranelift back ends to rustc - which in the current implementation it would not be.
Even more problematic, carrying vector elements in reverse order in vector registers actually affects the ABI, as vector types are passed in vector registers. Code compiled by rustc using the LLVM back end would expect those to be in the "normal" platform order, while code compiled by rustc using the cranelift back end would expect them to be the "reverse" order.
One option I'm thinking of would be to actually implement both methods in the cranelift back end. Specifically, the back end could support both a "vector big-endian" and "vector little-endian" mode, where the "big-endian" mode would use lane numbers directly as defined by our ISA, while the "little-endian" mode would use the reverse ordering implemented by the current back end code.
There's a slight complication in that we might have to support both big- and little-endian vector modes in load and store operations accessing any combination of big- and little-endian memory locations. But that should be possible:
vector mode memory byte order load/store operation
big-endian big-endian direct
big-endian little-endian byte-reverse each element
little-endian big-endian reverse order of elements
little-endian little-endian byte-reverse 16-byte input
(Starting with z15 we actually can implement each of these operations using a single instruction, so that should also be efficient.)
The one remaining question is, how to select between the "vector big-endian" and "vector little-endian" modes? There are no attributes (like MemFlags) on the extractlane etc. operations, and that wouldn't even make sense: this is a global property, if you used big-endian mode to load the vector you must also use big-endian mode on all subsequent operations.
So I guess this would have to be some sort of global flag, which wasmtime would set to always little-endian, and rustc would leave at native byte order. Of course, this flag is actually ABI changing, so the same setting must be used for code to remain link-compatible. But I think given the above use cases, that should actually be fine.
FYI @cfallin - this is another problem I ran into while enabling rustc_codegen_cranelift - I'd appreciate any comments or suggestions!
Currently,
s390xis the only big-endian platform supported by cranelift, which has already caused problems related to memory accesses in the past, since wasmtime defines all data in its memory to be in little-endian byte order.I had assumed that register contents are not affected by byte order, and while this is true for scalar types, it turns out this is not actually the case for vector registers. This is because of a combination of two issues:
The combination of these two makes ISA byte order properties of vector registers visible to IR code, just like byte order in memory is visible via a combination of using memory addresses to access (parts of) a value in memory in a different type than it was stored.
Specifically, in the Intel (or ARM) little-endian ISA, if you load e.g. a
I32X4into a vector register, useraw_bitcastto re-interpret that register asI8X16, and retrieve lane 0 of the resulting value viaextractlane, the result will be the least-significant byte of theI32in lane 0 of the original value. And in fact, wasmtime requires this to be the case.On the other hand, on
s390xthe content of a vector register inI32X4mode will be a sequence of fourI32, each in big-endian byte order (or else the arithmetic operations on the register will give wrong results). Therefore, the same sequence of operations as above will return the most-significant byte of theI32in lane 0. This actually caused many wasmtime tests to fail.To fix this, my current implementation of SIMD instructions on
s390xuses a trick combining two different aspects:I32X4from little-endian memory into a vector register, I'm byte-reversing all 16 bytes of the loaded value. This not only fixes eachI32value to be in the correct big-endian order in the register so subsequent arithmetic will work, it also implicitly swaps the order of elements, i.e. the element in slot 0 in memory will end up in what the ISA considers slot 3 of the register etc.extractlanefor lane 0 of aI32X4will actually at the ISA level extract lane 3 of the register.The combination of these two aspects makes accessing SIMD registers work correctly for wasmtime. For example, in the above case, accessing lane 0 of a
I8X16is converted to lane 15 of the register, which holds the least-significant byte of theI32in lane 3 of the register, which was loaded from lane 0 in memory -- so in the end we return the least-significant byte of theI32in lane 0 of the original value, as expected by wasmtime.However, in implementing support for
rustc_codegen_cranelift, I noticed that the current implementation actually breaks when handling big-endian vectors in memory - this will be the case for rustc, since that uses native platform byte order everywhere. Specifically, when loading a big-endian vector from memory, I'm just loading the bytes unchanged. This means that e.g. lane 0 of aI32X4in memory ends up in its original byte order (which is OK since this is already big-endian) in lane 0 of the register - but the latter is a problem if subsequent code wants to extract that lane, since anextractlane 0will actually access lane 3 in the register as described above!To work around that problem, I've implemented a patch that will perform big-endian vector loads by swapping the order of the elements, but not the byte order within any single element. This will cause lane 0 from memory to end up in lane 3 in the register, and makes
extractlanework as expected again.With that fix, now both wasmtime and rustc_codegen_cranelift pass all their existing SIMD tests. Unfortunately, I think this is still not quite a complete solution.
Specifically, we can now run into two additional problems with big-endian code, which apparently are just never triggered by the existing rustc_codegen_cranelift tests.
First, I guess it would be possible to re-interpret contents in a vector register in another type even in rustc. Now, as opposed to wasmtime, rustc uses native endianness, presumably also w.r.t. vector contents. Therefore, the semantics of such a re-interpretation would be platform-defined and differ between big- and little-endian platforms (which is probably why it's not frequently used). However, users would expect this platform-specific behavior to be the same between the LLVM and cranelift back ends to rustc - which in the current implementation it would not be.
Even more problematic, carrying vector elements in reverse order in vector registers actually affects the ABI, as vector types are passed in vector registers. Code compiled by rustc using the LLVM back end would expect those to be in the "normal" platform order, while code compiled by rustc using the cranelift back end would expect them to be the "reverse" order.
One option I'm thinking of would be to actually implement both methods in the cranelift back end. Specifically, the back end could support both a "vector big-endian" and "vector little-endian" mode, where the "big-endian" mode would use lane numbers directly as defined by our ISA, while the "little-endian" mode would use the reverse ordering implemented by the current back end code.
There's a slight complication in that we might have to support both big- and little-endian vector modes in load and store operations accessing any combination of big- and little-endian memory locations. But that should be possible:
(Starting with
z15we actually can implement each of these operations using a single instruction, so that should also be efficient.)The one remaining question is, how to select between the "vector big-endian" and "vector little-endian" modes? There are no attributes (like
MemFlags) on theextractlaneetc. operations, and that wouldn't even make sense: this is a global property, if you used big-endian mode to load the vector you must also use big-endian mode on all subsequent operations.So I guess this would have to be some sort of global flag, which wasmtime would set to always little-endian, and rustc would leave at native byte order. Of course, this flag is actually ABI changing, so the same setting must be used for code to remain link-compatible. But I think given the above use cases, that should actually be fine.
FYI @cfallin - this is another problem I ran into while enabling
rustc_codegen_cranelift- I'd appreciate any comments or suggestions!