Further Improve Execution Speed with Explicit Bounds Checks

I've been digging into Cranelift's and Wasmtime's code quality and Wasm execution throughput when "dynamic memories" with explicit bounds checks are used to implement Wasm linear memories. Here I'm summarizing the progress made thus far here, and laying out what I believe the next steps to continue this effort should be.

Static memories require large reservations of virtual memory, to the extent that it can become a bottleneck for how many Wasm instances can exist concurrently in a system. Dynamic memories, on the other hand, require very little or even zero virtual memory reservations (depending on configuration) beyond the actually resident pages backing the Wasm linear memory itself. Additionally, when dynamic memories are configured without any guard regions, they incur zero virtual memory-related syscalls on instantiation and tear down (modulo any copy-on-write heap image initialization). [These syscalls have been observed to be bottlenecks in concurrent systems due to VMA locks taken in the kernel.](https://github.com/bytecodealliance/wasmtime/issues/4637)

Furthermore these issues will only compound in the future. Right now, each tenant in a multi-tenant system is a single core Wasm module with a single Wasm linear memory. With the component model, a tenant will instead be a Wasm component that is made up of many core Wasm modules, each with their own linear memories. If the average Wasm component contains five linear memories, then the multi-tenant system can only support one fifth as many concurrent tenants under the component model, and instantiating and tearing down each tenant's service will take five times as many syscalls as it does today.

So there is a lot of pressure towards implementing Wasm linear memories with dynamic memories. The downside is that dynamic memories currently incur roughly 55% overhead to Wasm execution throughput compared to static memories. That is, a Wasm computation that takes 1 second in a static memory will take about 1.55 seconds in a dynamic memory. This is because dynamic memory accesses must be explicitly bounds checked rather than caught after the fact via page faults like out-of-bounds static memory accesses. The goal of the effort described here is to minimize this overhead.

I don't intend to continue this work in the immediate future, so I wanted to make sure everything was summarized and documented for posterity, as well as written down while it is still fresh in my mind. Additionally I have proposals for new optimization passes we should implement in the future, and this issue can serve as a forum for discussing them.

## Progress Made Thus Far

**Before I began this effort, running the `spidermonkey.wasm` benchmark in Sightglass with static memories was 3.06x faster than doing the same with dynamic memories. Now we are down to static memories being only 1.55x faster.**

This improvement comes from a number of sources:

* [Deduplicating explicit bounds checks and Spectre mitigations](https://github.com/bytecodealliance/wasmtime/pull/5190), since our Spectre mitigations effectively perform bounds checks already, the additional explicit bounds checks were redundant.
* Ensuring that our global value numbering (GVN) optimization pass could optimize and deduplicate [`uadd_overflow_trap`](https://github.com/bytecodealliance/wasmtime/pull/5520) and [`select_spectre_guard`](https://github.com/bytecodealliance/wasmtime/pull/5517) instructions, both of which are emitted in our bounds-checking sequence.
* The introduction of a variety of [new](https://github.com/bytecodealliance/wasmtime/pull/5676) [peephole](https://github.com/bytecodealliance/wasmtime/pull/5772) [optimizations](https://github.com/bytecodealliance/wasmtime/pull/5673) and [constant propagation rules](https://github.com/bytecodealliance/wasmtime/pull/5671) that were automatically synthesized by [Souper](https://github.com/google/souper) from snippets of code harvested from `spidermonkey.wasm`.
* [New special cases](https://github.com/bytecodealliance/wasmtime/pull/6031) for certain dynamic memory configurations in our bounds-checking implementation.

Additionally, [I created the `wasmtime explore` subcommand](https://github.com/bytecodealliance/wasmtime/pull/5975) to dig into native code disassemblies and associate each native instruction with the Wasm bytecode it implements (similar to the Godbolt Compiler Explorer). This new tool has already been incredibly helpful for analyzing what room for improvement still exists, and should help further future optimization efforts in Cranelift and Wasmtime.

### Recent Sightglass Benchmark Results

Here are some recent Sightglass benchmark results comparing static memories (`static.so`) versus dynamic memories with a small (64KiB) guard region (`dyn-with-guard.so`) versus dynamic memories without a guard region (`dyn-without-guard.so`).

#### `static.so` vs `dyn-with-guard.so`

<details>

```
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 517183665.48 ± 6392410.71 (confidence = 99%)

  static.so is 1.55x to 1.56x faster than dyn-with-guard.so!

  [1420787062 1450201603.68 1522542159] dyn-with-guard.so
  [914013248 933017938.20 1036142782] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45802386.02 ± 1486278.64 (confidence = 99%)

  static.so is 1.42x to 1.45x faster than dyn-with-guard.so!

  [146742085 150423911.30 159078923] dyn-with-guard.so
  [100319894 104621525.28 149448856] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 3014103.57 ± 316843.43 (confidence = 99%)

  static.so is 1.32x to 1.40x faster than dyn-with-guard.so!

  [10662276 11365102.25 16630094] dyn-with-guard.so
  [7683153 8350998.68 13449448] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 45130213.24 ± 4838200.92 (confidence = 99%)

  static.so is 1.18x to 1.22x faster than dyn-with-guard.so!

  [249993052 271654226.14 316970220] dyn-with-guard.so
  [208845361 226524012.90 290090035] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 53326476.11 ± 5528961.43 (confidence = 99%)

  static.so is 1.16x to 1.19x faster than dyn-with-guard.so!

  [330373001 356824375.72 397185785] dyn-with-guard.so
  [279072702 303497899.61 348984338] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1163853854.64 ± 81595847.03 (confidence = 99%)

  static.so is 1.15x to 1.18x faster than dyn-with-guard.so!

  [8097294016 8213392062.47 10103274666] dyn-with-guard.so
  [6969056325 7049538207.83 8442021515] static.so
```

</details>

#### `static.so` vs `dyn-without-guard.so`

<details>

```
execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 594932587.44 ± 6811906.18 (confidence = 99%)

  static.so is 1.63x to 1.65x faster than dyn-without-guard.so!

  [1494952789 1524945424.11 1606143187] dyn-without-guard.so
  [908877847 930012836.67 980176520] static.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 49482140.46 ± 2175287.73 (confidence = 99%)

  static.so is 1.45x to 1.49x faster than dyn-without-guard.so!

  [149370743 154540290.45 163713075] dyn-without-guard.so
  [99902180 105058149.99 163151023] static.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 54959651.89 ± 5371934.09 (confidence = 99%)

  static.so is 1.22x to 1.26x faster than dyn-without-guard.so!

  [258958338 282812487.71 327560254] dyn-without-guard.so
  [210542623 227852835.82 315990395] static.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1592325533.89 ± 100212378.19 (confidence = 99%)

  static.so is 1.21x to 1.24x faster than dyn-without-guard.so!

  [8494305552 8674966162.14 10607319542] dyn-without-guard.so
  [6985640940 7082640628.25 8405451996] static.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 66518323.42 ± 5778757.98 (confidence = 99%)

  static.so is 1.20x to 1.24x faster than dyn-without-guard.so!

  [342947509 371023668.35 415514806] dyn-without-guard.so
  [277471563 304505344.93 339940435] static.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 1217702.33 ± 627813.45 (confidence = 99%)

  static.so is 1.06x to 1.18x faster than dyn-without-guard.so!

  [11111517 11632130.88 14705814] dyn-without-guard.so
  [7682848 10414428.55 14122261] static.so
```

</details>

#### Comparing Wasmtime and SpiderMonkey

To get a sense of whether our 1.55x slowdown when enabling explicit bounds checks was typical of other engines, I also benchmarked SpiderMonkey (native) running Sightglass's `spidermonkey.wasm` benchmark with virtual memory guard pages vs with explicit bounds checks.

```
execution :: milliseconds :: spidermonkey.wasm

  Δ = 163.62 ± 5.57 (confidence = 99%)

  virtual-memory-spidermonkey is 1.52x to 1.56x faster than bounds-checks-spidermonkey!

  [446 465.49 520] bounds-checks-spidermonkey
  [290 301.87 353] virtual-memory-spidermonkey
```

Because SpiderMonkey's slowdown is almost identical to Wasmtime's, we can infer that there aren't any critical bounds-checking optimizations that SpiderMonkey performs but Wasmtime/Cranelift does not. This is reassuring: it means we can basically ignore other engines (or at least SpiderMonkey) for now, when trying to further improve codegen when explicit bounds checks are enabled, and focus only on ourselves.

## Proposed Future Work

What fruit remains to be harvested? I've been digging into `spidermonkey.wasm`'s hot functions (as reported by `perf` on Linux) and staring at disassemblies to get an idea of how much room for improvement we still have. In short: a lot!

I found many sequences of back-to-back memory accesses in hot functions, often with

* the same dynamic index,
* different static offsets, and
* ordered such that accesses with larger static offsets came first (or at least early) within the sequence of accesses.

This is great because it is the next-easiest pattern to optimize for (after sequences of identical accesses, which we can already optimize well).

Even when the accesses weren't roughly sorted from largest to smallest static offset, the offsets were always small enough to be covered by a small guard region. This allows us to bounds check `index > bound` instead of `index + offset > bound` which allows us to [GVN these comparisons](https://github.com/bytecodealliance/wasmtime/pull/6031) even when their static offsets aren't in order. After that, all we are left with are redundant `select_spectre_guard`s, which I will describe how we can eliminate in a moment.

For example, I found a back-to-back sequence of nine stores that all have the same dynamic index operand but different static offset immediates, all in the same basic block:

    local.get 2
    ...
    i64.store offset=368
    local.get 2
    ...
    i64.store offset=352
    local.get 2
    ...
    i32.store offset=336
    local.get 2
    ...
    i32.store offset=328
    local.get 2
    ...
    i32.store offset=320
    local.get 2
    ...
    i32.store offset=312
    local.get 2
    ...
    i32.store offset=304
    local.get 2
    ...
    i32.store offset=296
    local.get 2
    ...
    i32.store offset=360

Only the first store in this sequence needs to be bounds checked! All the following stores' bounds checks are implied by the first check, but we emit the bounds check for every single one of them.

Based on this investigation, I suggest we implement two new optimization passes:

1. a `select_spectre_guard`-elimination pass,
2. and a basic constraint-elimination pass.

### `select_spectre_guard`-Elimination Pass

We should develop an optimization pass to remove redundant `select_spectre_guard` instructions.

This optimization pass would walk the dominator tree and maintain the set of `spectre_select_guard` conditions in scope that we've already tested *and* whose results have flowed into an instruction that will raise a trap if the `select_spectre_guard` failed. That is, maintain the set of `x`s for all sequences like

    load(select_spectre_guard(x, 0, _))

that dominate the current location in our traversal.

Then, whenever we see a new `select_spectre_guard(x, 0, y)`, we can rewrite it to simply `y`, since the dominating `load(select_spectre_guard(x, 0, _))` would have raised a trapped already if `x` was truthy, so control flow reaching this location means that `x` must not be truthy and therefore `select_spectre_guard(x, _, y)` will always evaluate to `y`.

This does not break the speculative sandboxing function of the optimized-away `select_spectre_guard`s because there will always still be at least one dominating `select_spectre_guard` for the same condition.

This optimization pass should integrate with the e-graphs framework in the same way that our alias analysis optimization pass does. It should expose a "push"-based interface, where blocks and instructions are fed into the analysis by an external traversal, rather than having the pass walk the IR itself. This will allow us to fuse the pass with existing IR traversals.

Creating this new optimization pass is not super easy, so if we wanted to get an idea of what kind of speed ups it would give, we can implement the much easier https://github.com/bytecodealliance/wasmtime/issues/6055, which would give us the non-Spectre equivalent of this optimization pass. The amount that implementing #6055 speeds up benchmarks in an environment configured with Spectre mitigations disabled is roughly what we could expect from this new optimization pass with Spectre mitigations enabled.

#### A Potential Variation

This analysis is slightly subtle in that it requires chaining a few things together (loads/stores that consume a `select_spectre_guard` that then has a condition `x`). In general, the fewer links in the chain, the simpler the analysis's implementation will be and the more confidence we can have in its correctness. To that end, we could consider having `spectre_load(oob_condition, addr)` and `spectre_store(oob_condition, addr, value)` instructions that are morally equivalent to `load(select_spectre_guard(oob_condition, 0, addr))` and similar for stores, but which do not get legalized in the mid-end and go all the way to our backends. By fusing the instructions, we remove one link in that proof chain that the analysis has to create, making it simpler and more likely to be correct. The downside is that we don't *only* use `select_spectre_guard` with plain `load` and `store` CLIF instructions; there are `{u,s}{load,store}{8,16,32}` instructions which would all either need Spectre-fused variants or to continue using the old `select_spectre_guard` (and presumably not reap the benefits of this simpler analysis), SIMD load instructions, atomic access instructions, etc. I've filed https://github.com/bytecodealliance/wasmtime/issues/6056 to help consolidate some of these instructions, which would slightly alleviate this downside, but it would not be fully addressed. Because of that, I'm thinking that this approach with Spectre-fused instructions is probably not worth the effort.

### Basic Constraint-Elimination Pass

We should develop a very basic constraint-elimination optimization pass targeted at bounds checks for 32-bit Wasm memories on 64-bit hosts.

This would *not* be a fully general constraint-elimination pass, like what [LLVM has recently added](https://www.youtube.com/watch?v=1hm5ZVmBEvo).  Instead it would focus only on comparisons of roughly the form `index + constant > heap_bound`.

Consider this example, with constants `X` and `Y`:

    v3 = index + X > heap_bound
    ...
    v6 = index + Y > heap_bound

When `X > Y`, then `v3` implies `v6`, and we can rewrite the example into the following:

    v3 = index + X > heap_bound
    ...
    v6 -> v3

The catch is that this is only correct *if `index + X` does not overflow*. Otherwise, it is possible that `index + X` is "logically out of bounds" but because it overflowed and wrapped around, it ended up back in bounds again. In this scenario, `Y` might not be large enough for `index + Y` to overflow, and `index + Y` could remain out of bounds.

We can avoid this pitfall by limiting the pass's scope to optimizing bounds checks for only 32-bit memories on 64-bit hosts, which have the following pattern:

    uextend(index:i32) + X:i64 > heap_bound:i64

For these 32-bit bounds checks, the constant `X` fits in an `i32` -- despite being represented as an `i64` -- and therefore we are guaranteed that `index + X` will not overflow. As long as the proposed optimization pass checks this constraint on `X` when matching that code pattern, it will be sound.

This optimization pass would precede the `select_spectre_guard`-elimination pass described previously, since it would deduplicate bounds checks, which then feed into `spectre_select_guard`s, which that pass would then be able to completely remove.

For example, consider this input CLIF for two back-to-back loads:

```wat
    ;; First we load the Wasm address `index + X`.

    ;; The `index + X > heap_bound` check.
    v0 = uextend.i64 index
    v1 = iadd_imm v0, X
    v2 = icmp_imm ugt v1, heap_bound
    ;; The native address for the first Wasm load.
    v3 = iadd heap_base, v1
    ;; The Spectre guard for the first Wasm load.
    v4 = select_spectre_guard v2, null, v3
    ;; The first Wasm load itself.
    v5 = load.i32 v4

    ;; Second we load the Wasm address `index + Y`, where `X > Y`.
    v6 = iadd_imm v0, Y
    v7 = icmp_imm ugt v6, heap_bound
    v8 = iadd heap_base, v6
    v9 = select_spectre_guard v7, null, v8
    v10 = load.i32 v9
```

After running this proposed constraint-elimination pass, we learn that `v2` implies `v7` because `X > Y` and so we can rewrite uses of `v7` with `v2`:

```wat
    ;; First load (unchanged).
    v0 = uextend.i64 index
    v1 = iadd_imm v0, X
    v2 = icmp_imm ugt v1, heap_bound
    v3 = iadd heap_base, v1
    v4 = select_spectre_guard v2, null, v3
    v5 = load.i32 v4

    ;; Second load (comparison replaced with alias of previous comparison).
    v6 = iadd_imm v0, Y
    v7 -> v2
    v8 = iadd heap_base, v6
    v9 = select_spectre_guard v2, null, v8
    v10 = load.i32 v9
```

Finally, after running the previously described `select_spectre_guard`-elimination pass, we can completely remove the second `select_spectre_guard` bounds check since it is identical to and dominated by the first:

```wat
    ;; First load (unchanged).
    v0 = uextend.i64 index
    v1 = iadd_imm v0, X
    v2 = icmp_imm ugt v1, heap_bound
    v3 = iadd heap_base, v1
    v4 = select_spectre_guard v2, null, v3
    v5 = load.i32 v4

    ;; Second load (all bounds checks removed).
    v6 = iadd_imm v0, Y
    v7 -> v2
    v8 = iadd heap_base, v6
    v9 -> v8
    v10 = load.i32 v8
```

This final code is the optimal implementation of the original input.

This optimization pass should also integrate with the e-graphs framework in the same way that the previously described optimization pass would and the way that our alias analysis optimization pass does.

To get an idea of the speed up this optimization can give dynamic memories without any guard pages, we can look at the speed up that #6055 gives dynamic memories with guard pages when Spectre mitigations are disabled. This is at least 1.08x speed up, as that is what we've seen from deduplicating bounds checking comparisons (but not the `select_spectre_guard`s) in #6031, but probably closer to at least 1.15x when paired with removing no-longer-necessary `select_spectre_guard`s if I were to take a wild guess.

### Optimizing Memory Accesses with Distinct Dynamic Indices

The examples we've considered thus far have a series of memory accesses that use the same dynamic index, but different static offsets. For example, each of the following loads all have the same `local.get 4` dynamic index:

    local.get 4
    i32.load offset=24
    local.get 4
    i32.load offset=18
    local.get 4
    i32.load offset=12

We haven't considered when the memory accesses have different dynamic indices, like in this example:

    local.get 4
    i32.load
    local.get 12
    i32.load
    local.get 8
    i32.load

There are two primary reasons for this:

1. The latter is much harder to optimize. Determining any kind of relationship between the different indices is hard. Sometimes the second access's index is loaded by the first access (e.g. `load(load(...))`), which will pretty much defeat any static analysis we throw at it. Even if the indices are each of the form `base + a` and `base + b`, and we know that `a > b`, we have to consider whether overflow occurs in `base + a` (see the discussion about overflow in the basic constraint-elimination pass section above). Developing a pass to optimize these cases will be difficult, complicated, and prone to unsoundness.

2. Most of the sequences of back-to-back memory accesses I found in `spidermonkey.wasm`'s hot functions had the same dynamic index. This is great news since these are also the cases we can optimize relatively easily.

To conclude, I don't think it is worth the effort trying to optimize these memory access sequences, where different dynamic indices are accessed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further Improve Execution Speed with Explicit Bounds Checks #6094

Progress Made Thus Far

Recent Sightglass Benchmark Results

`static.so` vs `dyn-with-guard.so`

`static.so` vs `dyn-without-guard.so`

Comparing Wasmtime and SpiderMonkey

Proposed Future Work

`select_spectre_guard`-Elimination Pass

A Potential Variation

Basic Constraint-Elimination Pass

Optimizing Memory Accesses with Distinct Dynamic Indices

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Further Improve Execution Speed with Explicit Bounds Checks #6094

Description

Progress Made Thus Far

Recent Sightglass Benchmark Results

static.so vs dyn-with-guard.so

static.so vs dyn-without-guard.so

Comparing Wasmtime and SpiderMonkey

Proposed Future Work

select_spectre_guard-Elimination Pass

A Potential Variation

Basic Constraint-Elimination Pass

Optimizing Memory Accesses with Distinct Dynamic Indices

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`static.so` vs `dyn-with-guard.so`

`static.so` vs `dyn-without-guard.so`

`select_spectre_guard`-Elimination Pass