Skip to content

Multi-register value support: framework for Values wider than machine registers.#2538

Merged
cfallin merged 1 commit into
bytecodealliance:mainfrom
cfallin:multi-reg-framework
Jan 6, 2021
Merged

Multi-register value support: framework for Values wider than machine registers.#2538
cfallin merged 1 commit into
bytecodealliance:mainfrom
cfallin:multi-reg-framework

Conversation

@cfallin
Copy link
Copy Markdown
Member

@cfallin cfallin commented Jan 4, 2021

This will allow for support for I128 values everywhere, and I64
values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the
machine backends to build such support; it just adds the framework for
the MachInst backends to reason about a Value residing in more than
one register.

This is a finalized version of the framework part of the draft PR (#2504); I128
operator implementations will come in a followup PR.

@cfallin
Copy link
Copy Markdown
Member Author

cfallin commented Jan 4, 2021

CI error seems unrelated (perhaps caused by recent Rust 1.49 release?):

 note: rust-lld: error while loading shared libraries: libLLVM-11-rust-1.49.0-stable.so: cannot open shared object file: No such file or directory

@github-actions github-actions Bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen labels Jan 4, 2021
Comment thread cranelift/codegen/src/machinst/valueregs.rs Outdated
@cfallin cfallin force-pushed the multi-reg-framework branch 2 times, most recently from 4aef54e to 6f07907 Compare January 4, 2021 20:59
Copy link
Copy Markdown
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a lot of work! Quite mechanical in the lower.rs files, which is satisfying.
I'm a tiny bit worried about the overhead (compile-time) cost this will add to the "general" case, where we're in fact using one real reg for each SSA value (i.e. not using i128 on x64, etc.). In this case, it seems there will be a lot of additional loops over a ValueRegs of size 1. Maybe the branch predictors will be happy to deal with those, but the loop setups themselves could incur a visible cost too. Instead of me wild-guessing: do you have any rough performance numbers (esp. count of instructions, branch cache misses) before/after this patch, for the single-realreg-per-value case?

Related to one proposal/question I formulated below, and if the performance overhead justifies it, here's a design question: instead of having all the machinery be constructed on top of ValueRegs, could we split helpers in two ways:

  • have one variant for ValueRegs, specially used to handle the multi-regs-per-value case
  • have one variant for single RealRegs, since this is likely to be a frequent case?

This means we'd need then to consider single regs vs multi-regs-per-value differently during lowering, but my understanding is that this should be the case already. E.g. for add i128, we'd use a different open-coded sequence of instructions that the non-i128 add wouldn't use. Thoughts?

If the above suggestion doesn't make sense, and the overhead is negligible or if it can be countered with minimal effort: approving as is. Otherwise, if it makes sense or the PR significantly changes, I'd be happy to take another look! Thanks.

Comment thread cranelift/codegen/src/machinst/valueregs.rs
Comment thread cranelift/codegen/src/machinst/valueregs.rs
Comment thread cranelift/codegen/src/machinst/valueregs.rs Outdated
Comment thread cranelift/codegen/src/machinst/valueregs.rs Outdated
Comment thread cranelift/codegen/src/machinst/lower.rs Outdated
Comment thread cranelift/codegen/src/isa/x64/inst/mod.rs
Comment thread cranelift/codegen/src/isa/x64/lower.rs Outdated
let tmp1 = ctx.alloc_tmp(RegClass::I64, types::I64);
let tmp2 = ctx.alloc_tmp(RegClass::I64, types::I64);
let cst = ctx.alloc_tmp(RegClass::I64, types::I64);
let tmp1 = ctx.alloc_tmp(types::I64).only_reg().unwrap();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design thought: would it make sense to have a alloc_singlereg_tmp helper that avoids the constructing of an Option and unwrapping it, and returns only a single reg without using the ValueRegs construct? This line being repeated so many times makes me wary of the penalty cost of this multi-regs-per-value machinery for the general case where it's unused...

Comment thread cranelift/codegen/src/isa/aarch64/inst/args.rs Outdated
Comment thread cranelift/codegen/src/isa/aarch64/inst/mod.rs
@cfallin cfallin force-pushed the multi-reg-framework branch 2 times, most recently from ea25756 to d67876e Compare January 6, 2021 01:30
@cfallin
Copy link
Copy Markdown
Member Author

cfallin commented Jan 6, 2021

Thanks for the detailed review!

One top-level point about special-casing the one-reg case vs. generalizing with ValueRegs everywhere: I did a quick measurement of compilation instruction count before/after this PR (clif-util wasm --target x86_64 bz2.wasm) as measured by perf stat:

Before:
     1,960,194,985      instructions              #    1.99  insn per cycle
     1,956,941,087      instructions              #    1.98  insn per cycle
     1,944,661,945      instructions              #    1.95  insn per cycle

After:
     1,951,673,951      instructions              #    1.98  insn per cycle
     1,951,074,339      instructions              #    1.98  insn per cycle
     1,945,951,877      instructions              #    1.98  insn per cycle

In other words, it seems to be in the measurement noise. I sort-of expected this given that (on the 64-bit platform case) we turned a 32-bit Reg into a 64-bit (Reg, Reg) (basically) with a bunch of nonzero checks of the top half and almost-never-taken branches.

Given that tiny/no overhead, I would strongly prefer to not special-case the one-register case, as it adds significant complexity; IMHO we're starting to get a slight case of "too much nested-if disease" in some places and I'd really prefer to keep just one code path where we can :-)

Anyway, lots of updates here so PTAL again if you like!

… regs.

This will allow for support for `I128` values everywhere, and `I64`
values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the
machine backends to build such support; it just adds the framework for
the MachInst backends to *reason* about a `Value` residing in more than
one register.
@cfallin cfallin force-pushed the multi-reg-framework branch from d67876e to 6eea015 Compare January 6, 2021 01:45
@cfallin
Copy link
Copy Markdown
Member Author

cfallin commented Jan 6, 2021

Just spoke with @bnjbvr and will go ahead and merge this with existing approval (thanks!).

@cfallin cfallin merged commit f579d08 into bytecodealliance:main Jan 6, 2021
@cfallin cfallin deleted the multi-reg-framework branch January 6, 2021 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants