Draft: I128 support (partial) on x64.#2504
Conversation
|
For context, I should also note: this ( |
93da5e7 to
90fea55
Compare
bjorn3
left a comment
There was a problem hiding this comment.
❤️ ❤️ ❤️
To avoid duplication of the lowering logic between backends maybe it would be possible to have a kind of template clif ir that can be instanced during lowering. Such a template would use specific Values for inputs and outputs. The rest are temporary variables. This could be faster than legalization as it doesn't require modifying the input clif ir. It would also allow backends to chose between multiple templates when there are different ways to lower an instruction with different tradeoffs.
|
|
||
| if op != Opcode::Imul { | ||
| // add, sub, and, or, xor: just do ops on lower then upper half. Carry-flag | ||
| // propagation is implicit (add/adc, sub/sbb). |
There was a problem hiding this comment.
Can regalloc insert an instruction in between?
There was a problem hiding this comment.
Regalloc is specifically disallowed from inserting any instructions that change the condition codes, for exactly this reason. This is specified in comments in the regalloc.rs interface.
|
@cfallin As a general comment, I see the dilemma. I haven't yet read the patch, but wanted to ask: part of the motivation for going this route was that we would then be able to remove the complexity, the semantic murkyness, and the potential for generating poor code, caused by having to explicitly represent the carry flag in CLIF. If we revert to expanding out doubleword arithmetic at the CLIF level, is there some other way we can avoid those disadvantages? Also .. does it matter that, doing this per-target (per this patch) potentially makes it possible to use target-specific sequences, which wouldn't be possible if it were done at the CLIF level? |
| RegMemImm::reg(regs::rdx()), | ||
| dst.regs()[1], | ||
| )); | ||
| } |
There was a problem hiding this comment.
@cfallin Is it really so bad that we have to duplicate these circa 50 lines per-target, to do mul.i128? It would be even shorter if rustfmt didn't insist on laying out the calls in this space-inefficient way.
There will be roughly equivalent length sequences for 128-bit left/right shifts, and for 128-bit comparisons. For 128-bit division, we'll have to call a helper on all targets. On 32-bit targets, we'll probably have to use a helper for multiplies, shifts and comparisons too.
For example, doubleword comparison on x64 is a bit elaborate, but for aarch64 it's just |
|
I just tested this with cg_clif. As far as I can see this works fine, but there are still missing pieces both related and unrelated to 128bit int support which mean that I can't completely test this yet. support necessary in any case:
support necessary to not require modifications of cg_clif:
|
|
One thing that does concern me is I had wondered if such generality is really necessary. Given that -- IIUC -- all the result virtual register fields are allocated before isel starts, would it be adequate to have an representation which specifies a base virtual register, plus zero, one or two subsequently-numbered vregs of the same class? Then that would fit in 34 bits (hence, a Even if that won't work, does |
| } | ||
|
|
||
| /// Return the single register used for this value, if any. | ||
| pub fn only_reg(self) -> Option<R> { |
There was a problem hiding this comment.
Could this be a Result with something like struct OnlyRegError; as error type? That would give a better panic message when unwrapping.
bjorn3
left a comment
There was a problem hiding this comment.
Calling functions with 128bit arguments isn't implemented.
|
So I'm starting to warm up to this approach relative to falling back on existing legalizations; it's really not too much more work to get to the end-point and the legalization is pretty hairy too. I'm working on the additional ops now; hopefully ready soon! To a few design-question points:
I played with this a bit with Compiler Explorer (link) -- it seems it's passed by reference as an arg, but returned in two regs (
Tempting idea! This doesn't handle the real-register case though (needed at least for ABI specs). I suppose we could special-case that but then we bifurcate some other APIs (gen load/spill, move, ...). I'm thinking that we might actually be able to optimize in another (much simpler) way -- on 64-bit machines and with our largest types being 128-bit, we can get away with the 1 / 2-reg cases. Perhaps we could define a static constant for size and, with some config magic, select
Yes, I did this to try to keep to the spirit of |
|
@cfallin Great work, btw! I should have said that earlier. |
|
Just added implementations for (I think) all of the relevant |
7b10317 to
1691c20
Compare
|
@bjorn3 I just added what I think might be a sufficient implementation of ELF TLS support (basically just copying your recipe in the old x86 backend), and also permitted Let me know if this is sufficient to get cg_clif working! If so, I'll clean this up a bit and split out some pieces to actually land it. |
9dc8153 to
75df0d4
Compare
| | &ir::ArgumentPurpose::CalleeTLS | ||
| | &ir::ArgumentPurpose::StructReturn => {} | ||
| | &ir::ArgumentPurpose::StructReturn | ||
| | &ir::ArgumentPurpose::StructArgument(_) => {} |
There was a problem hiding this comment.
StructArgument indicates that the argument is a pointer to a piece of memory with the given size that needs to be passed as on the stack in the arguments area. On the caller side you will need to memcpy it to the right place, on the callee side you need to get the address.
There was a problem hiding this comment.
OK, yes, this is making more sense now. Do you think you would be up for attempting an implementation? (My bandwidth is somewhat stretched thin at the moment, but I can come back to this at ... some point, eventually, if needed. Incidentally this is similar to what I think we also need for Windows fastcall, which is one of the other remaining new-backend TODOs...).
| | &ir::ArgumentPurpose::CallerTLS | ||
| | &ir::ArgumentPurpose::CalleeTLS | ||
| | &ir::ArgumentPurpose::StructReturn => {} | ||
| | &ir::ArgumentPurpose::StructReturn |
There was a problem hiding this comment.
The old backend didn't handle this either, but the system-v abi defines a specific register for this argument. In addition kn x86_64 at least I think you also need to regurn this value in a different register. (I guess to reduce the need for saving it on the caller side.)
| use regalloc::{RealReg, Reg, VirtualReg, Writable}; | ||
| use std::fmt::Debug; | ||
|
|
||
| #[cfg(feature = "arm32")] |
There was a problem hiding this comment.
Would it be preferable here to key it on the word size of the target that the resulting CL will be compiling for? Is that even possible, given the limitations of the Rust config etc system?
There was a problem hiding this comment.
Such an option does exist, but the wrench in the works is cross-compilation -- e.g. clif-utils is normally built with all targets enabled (so we need to support arm32 compilation even on an x64 host, etc).
There was a problem hiding this comment.
(Ah, sorry, actually I misread your comment -- pointer size of target, not host; I don't think we have a config option for that as we have custom features for each backend.)
Almost. The standard library compiles with the above TLS fix, but std_example fails at several points:
|
This comment has been minimized.
This comment has been minimized.
|
@bjorn3 I just pushed a number of fixes/updates; this I think addresses everything above except for the I'm not sure if the latter has to do with the former; it seems that the u128->f32 conversion is a call into libgcc so I suspect it's an ABI issue rather than an i128 arithmetic issue. (Thank goodness for that; I started to work out the open-coded sequence for each of the u128 to/from f32/f64 cases, signed/unsigned, and quickly ran away...) I won't have more time to look at this for a bit (I'm mostly away until after the new year) but if you have any further debugging insights or want to try to implement the (Also it mostly goes without saying but since this PR has grown quite a bit beyond just i128 arithmetic, I plan to split it up into proper pieces for separate review once we confirm that cg_clif works.) |
On my system it calls into rust's compiler-builtins.
That and use the correct register. It isn't very important though as the old backend didn't implement it either. |
|
Minimal repro: fn convert(i: u128) -> f32 {
if i == 0 {
// this branch is taken
return 0.0;
}
extract_sign(i); // except when this call is removed
100.0
}
fn main() {
assert_eq!(convert(100u128), 100.0);
}
fn extract_sign(i: u128) {}Assembly of 00000000000f4af5 <_ZN11std_example7convert17ha5c22c0f04088fadE>:
f4af5: 55 push %rbp
f4af6: 48 89 e5 mov %rsp,%rbp
f4af9: 48 83 ff 00 cmp $0x0,%rdi
f4afd: 40 0f 94 c0 sete %al
f4b01: 48 83 fe 00 cmp $0x0,%rsi
f4b05: 40 0f 94 c1 sete %cl
f4b09: 48 21 c1 and %rax,%rcx
f4b0c: 0f 84 08 00 00 00 je f4b1a <_ZN11std_example7convert17ha5c22c0f04088fadE+0x25>
f4b12: 0f 57 c0 xorps %xmm0,%xmm0
f4b15: 48 89 ec mov %rbp,%rsp
f4b18: 5d pop %rbp
f4b19: c3 retq
f4b1a: e8 a1 01 00 00 callq f4cc0 <_ZN11std_example12extract_sign17h8b332672bd416c0aE>
f4b1f: be 00 00 c8 42 mov $0x42c80000,%esi
f4b24: 66 0f 6e c6 movd %esi,%xmm0
f4b28: 48 89 ec mov %rbp,%rsp
f4b2b: 5d pop %rbp
f4b2c: c3 retq |
|
@bjorn3 thanks for that repro -- the disassembly made it pretty clear what was wrong: brz/brnz on an i128 used an incorrect width of and/or instruction. Pushed a fix just now. Locally, (Also, in case it's relevant, I set In any case, I'm happy that this seems to allow cg_clif to pass the tests in |
🎉
Can you add |
|
I tried building simple-raytracer, and the I tracked this back to a nonsense pointer passed into (I implemented |
|
This is a proc macro that has an ABI incompatibility with the cg_llvm comliled rustc due to |
|
@bjorn3 I went ahead and implemented There do seem to be a few last test failures later in the I'll clean up this mess of a draft branch when I'm actually "working" again on Jan 4 :-) |
05a26c2 to
394e028
Compare
|
Last two fixes pushed: All stdlib tests pass now! I really will close my laptop now and read some books until I clean this up in the new year :-) |
|
Yay! I hope you will have a great New Year! |
That's surely a case of the general redundant-REX (0x40) rule, or in this case the lack of 0x40? Does the remainder go in DL if there's an 0x40 prefix? |
Sadly no, a REX prefix doesn't change anything -- it's just the way the 8-bit variant is defined (reference). A likely remnant of the 8086 days when AL and AH were separately-useful registers for byte-sized values... (Edit: here's another reference showing more -- a REX byte will change whether the divisor arg refers to low byte of one of the 16 x86-64 registers, or a low/high byte of one of the 8 original registers; but remainder always goes into AH.) |
899b180 to
fdb8b9f
Compare
… regs. This will allow for support for `I128` values everywhere, and `I64` values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the machine backends to build such support; it just adds the framework for the MachInst backends to *reason* about a `Value` residing in more than one register.
This implements all of the ops on I128 that are implemented by the legacy x86 backend, and includes all that are required by at least one major use-case (cg_clif rustc backend). The sequences are open-coded where necessary; for e.g. the bit operations, this can be somewhat complex, but these sequences have been tested carefully. This PR also includes a drive-by fix of clz/ctz for 8- and 16-bit cases where they were incorrect previously.
This follows the implementation in the legacy x86 backend, including hardcoded sequence that is compatible with what the linker expects. We could potentially do better here, but it is likely not necessary. Thanks to @bjorn3 for a bugfix to an earlier version of this.
The StructReturn ABI is fairly simple at the codegen/isel level: we only need to take care to return the sret pointer as one of the return values if that wasn't specified in the initial function signature. Struct arguments are a little more complex. A struct argument is stored as a chunk of memory in the stack-args space. However, the CLIF semantics are slightly special: on the caller side, the parameter passed in is a pointer to an arbitrary memory block, and we must memcpy this data to the on-stack struct-argument; and on the callee side, we provide a pointer to the passed-in struct-argument as the CLIF block param value. This is necessary to support various ABIs other than Wasm, such as that of Rust (with the cg_clif codegen backend).
When using an 8-bit register in 64-bit mode on x86-64, the REX prefix semantics are somewhat subtle: without the REX prefix, register numbers 4--7 correspond to the second-to-lowest byte of the first four registers (AH, CH, BH, DH), whereas with the REX prefix, these register numbers correspond to the usual encoding (SPL, BPL, SIL, DIL). We could always emit a REX byte for instructions with 8-bit cases (this is harmless even if unneeded), but this would unnecessarily inflate code size; instead, the usual approach is to emit it only for these registers. This logic was present in some cases but missing for some other instructions: divide, not, negate, shifts. Fixes bytecodealliance#2508.
The implementations of several FP ops, such as fabs/fneg, used SSE instructions. This is not a problem per-se, except that load-op merging did not take *alignment* into account. Specifically, if an op on an f64 loaded from memory happened to merge that load, and the instruction into which it was merged was an SSE instruction, then the SSE instruction imposes stricter (128-bit) alignment requirements than the load.f64 did. This PR simply forces any instruction lowerings that could use SSE instructions to implement non-SIMD operations to take inputs in registers only, and avoid load-op merging. Fixes bytecodealliance#2507.
fdb8b9f to
21b16cf
Compare
|
Did some rebase/squash surgery on this branch -- I will split it into I think 8 PRs:
Several of the above are dependent on the multi-reg refactor but the rest should be reviewable in parallel. |
- urem/srem.i8: the 8-bit form of the DIV instruction on x86-64 places the remainder in AH, not RDX, different from all the other width-forms of this instruction. - select.b1: we were not recognizing selects of boolean values as integer-typed operations, so we were generating XMM moves instead (!). With these two changes on the i128 ops / etc branch, rustc_codegen_clif passes all tests!
21b16cf to
8de8950
Compare
|
Closing this draft PR now that the real PRs resulting from it are under review. |
This PR generalizes all of the
MachInstframework to reason about SSAValues as being located in multiple registers (one, two or four, currently, in an efficient packed form). This is necessary in order to handle CLIF with values wider than the machine width (I128on 64/32-bit machines andI64on 32-bit machines), unless we legalize it beforehand.It also adds support for some basic 128-bit ALU ops to the x64 backend, loewring these directly to open-coded instruction sequences (add/adc, sub/sbb, etc.).
@julian-seward1 and @bnjbvr: this is the approach we had discussed a long time ago (in January, I think!). It follows from the "every backend accepts the same IR" philosophy, with the idea that maybe we will get to the point where legalization is largely not necessary.
However: I must say that I'm not super-happy with the level of complexity this has added to the framework. The fact that the work we do in the x64 backend to support this will have to be repeated on aarch64 is kind of unfortunate; and this all feels somewhat silly given that we still have the legalization framework's narrowing support, and could use that instead. Philosophically, I think that legalization is actually the right approach here: we should be able to factor out "general machine-independent algorithm for 128-bit multiply with 64-bit pieces" from the specific machine backends.
So I'm inclined not to go in this direction, but (i) want to see what thoughts anyone might have, and (ii) save this for the record, in case we wire up the old legalizations for now but reconsider or need multi-reg values in the future.