Skip to content

Add GcRuntime and GcCompiler traits; i31ref support#8196

Merged
fitzgen merged 1 commit into
bytecodealliance:mainfrom
fitzgen:i31ref
Apr 4, 2024
Merged

Add GcRuntime and GcCompiler traits; i31ref support#8196
fitzgen merged 1 commit into
bytecodealliance:mainfrom
fitzgen:i31ref

Conversation

@fitzgen
Copy link
Copy Markdown
Member

@fitzgen fitzgen commented Mar 20, 2024

This is still a WIP. I think architecture and everything is there, although there are certain things to improve upon still, like the pooling allocator integration and the question of who allocates the memory used in a GC heap, but I think that can happen in follow up PRs. The big thing is that there are still some tests failing, a bunch new tests that need to be written, and at least one blocker (#8180) to be fixed. Also I need to rebase, which will be fun given all the churn related to tables recently. However, this is big enough that I think those things can happen in parallel with review on the main bits and the architecture and all that.


The GcRuntime and GcCompiler Traits

This commit factors out the details of the garbage collector away from the rest of the runtime and the compiler. It does this by introducing two new traits, very similar to a subset of those proposed in the Wasm GC RFC, although not all equivalent functionality has been added yet because Wasmtime doesn't support, for example, GC structs yet:

  1. The GcRuntime trait: This trait defines how to create new GC heaps, run collections within them, and execute the various GC barriers the collector requires.

    Rather than monomorphize all of Wasmtime on this trait, we use it as a dynamic trait object. This does imply some virtual call overhead and missing some inlining (and resulting post-inlining) optimization opportunities. However, it is much less disruptive to the existing embedder API, results in a cleaner embedder API anyways, and we don't believe that VM runtime/embedder code is on the hot path for working with the GC at this time anyways (that would be the actual Wasm code, which has inlined GC barriers and direct calls and all of that). In the future, once we have optimized enough of the GC that such code is ever hot, we have options we can investigate at that time to avoid these dynamic virtual calls, like only enabling one single collector at build time and then creating a static type alias like type TheOneGcImpl = ...; based on the compile time configuration, and using this type alias in the runtime rather than a dynamic trait object.

    The GcRuntime trait additionally defines a method to reset a GC heap, for use by the pooling allocator. This allows reuse of GC heaps across different stores. This integration is very rudimentary at the moment, and is missing all kinds of configuration knobs that we should have before deploying Wasm GC in production. This commit is large enough as it is already! Ideally, in the future, I'd like to make it so that GC heaps receive their memory region, rather than allocate/reserve it themselves, and let each slot in the pooling allocator's memory pool be either a linear memory or a GC heap. This would unask various capacity planning questions such as "what percent of memory capacity should we dedicate to linear memories vs GC heaps?". It also seems like basically all the same configuration knobs we have for linear memories apply equally to GC heaps (see also the "Indexed Heaps" section below).

  2. The GcCompiler trait: This trait defines how to emit CLIF that implements GC barriers for various operations on GC-managed references. The Rust code calls into this trait dynamically via a trait object, but since it is customizing the CLIF that is generated for Wasm code, the Wasm code itself is not making dynamic, indirect calls for GC barriers. The GcCompiler implementation can inline the parts of GC barrier that it believes should be inline, and leave out-of-line calls to rare slow paths.

All that said, there is still only a single implementation of each of these traits: the existing deferred reference-counting (DRC) collector. So there is a bunch of code motion in this commit as the DRC collector was further isolated from the rest of the runtime and moved to its own submodule. That said, this was not purely code motion (see "Indexed Heaps" below) so it is worth not simply skipping over the DRC collector's code in review.

Indexed Heaps

This commit does bake in a couple assumptions that must be shared across all collector implementations, such as a shared VMGcHeader that all objects allocated within a GC heap must begin with, but the most notable and far-reaching of these assumptions is that all collectors will use "indexed heaps".

What we are calling indexed heaps are basically the three following invariants:

  1. All GC heaps will be a single contiguous region of memory, and all GC objects will be allocated within this region of memory. The collector may ask the system allocator for additional memory, e.g. to maintain its free lists, but GC objects themselves will never be allocated via malloc.

  2. A pointer to a GC-managed object (i.e. a VMGcRef) is a 32-bit offset into the GC heap's contiguous region of memory. We never hold raw pointers to GC objects (although, of course, we have to compute them and use them temporarily when actually accessing objects). This means that deref'ing GC pointers is equivalent to deref'ing linear memory pointers: we need to add a base and we also check that the GC pointer/index is within the bounds of the GC heap. Furthermore, compressing 64-bit pointers into 32 bits is a fairly common technique among high-performance GC implementations12 so we are in good company.

  3. Anything stored inside the GC heap is untrusted. Even each GC reference that is an element of an (array (ref any)) is untrusted, and bounds checked on access. This means that, for example, we do not store the raw pointer to an externref's host object inside the GC heap. Instead an externref now stores an ID that can be used to index into a side table in the store that holds the actual Box<dyn Any> host object, and accessing that side table is always checked.

The good news with regards to all the bounds checking that this scheme implies is that we can use all the same virtual memory tricks that linear memories use to omit explicit bounds checks. Additionally, (2) means that the sizes of GC objects is that much smaller (and therefore that much more cache friendly) because they are only holding onto 32-bit, rather than 64-bit, references to other GC objects. (We can, in the future, support GC heaps up to 16GiB in size without losing 32-bit GC pointers by taking advantage of VMGcHeader alignment and storing aligned indices rather than byte indices, while still leaving the bottom bit available for tagging as an i31ref discriminant. Should we ever need to support even larger GC heap capacities, we could go to full 64-bit references, but we would need explicit bounds checks.)

The biggest benefit of indexed heaps is that, because we are (explicitly or implicitly) bounds checking GC heap accesses, and because we are not otherwise trusting any data from inside the GC heap, we greatly reduce how badly things can go wrong in the face of collector bugs and GC heap corruption. We are essentially sandboxing the GC heap region, the same way that linear memory is a sandbox. GC bugs could lead to the guest program accessing the wrong GC object, or getting garbage data from within the GC heap. But only garbage data from within the GC heap, never outside it. The worse that could happen would be if we decided not to zero out GC heaps between reuse across stores (which is a valid trade off to make, since zeroing a GC heap is a defense-in-depth technique similar to zeroing a Wasm stack and not semantically visible in the absence of GC bugs) and then a GC bug would allow the current Wasm guest to read old GC data from the old Wasm guest that previously used this GC heap. But again, it could never access host data.

Taken altogether, this allows for collector implementations that are nearly free from unsafe code, and unsafety can otherwise be targeted and limited in scope, such as interactions with JIT code. Most importantly, we do not have to maintain critical invariants across the whole system -- invariants which can't be nicely encapsulated or abstracted -- to preserve memory safety. Such holistic invariants that refuse encapsulation are otherwise generally a huge safety problem with GC implementations.

VMGcRef is NOT Clone or Copy Anymore

VMGcRef used to be Clone and Copy. It is not anymore. The motivation here was to be sure that I was actually calling GC barriers at all the correct places. I couldn't be sure before. Now, you can still explicitly copy a raw GC reference without running GC barriers if you need to and understand why that's okay (aka you are implementing the collector), but that is something you have to opt into explicitly by calling unchecked_copy. The default now is that you can't just copy the reference, and instead call an explicit clone method (not the Clone trait, because we need to pass in the GC heap context to run the GC barriers) and it is hard to forget to do that accidentally. This resulted in a pretty big amount of churn, but I am wayyyyyy more confident that the correct GC barriers are called at the correct times now than I was before.

i31ref

I started this commit by trying to add i31ref support. And it grew into the whole traits interface because I found that I needed to abstract GC barriers into helpers anyways to avoid running them for i31refs, so I figured that I might as well add the whole traits interface. In comparison, i31ref support is much easier and smaller than that other part! But it was also difficult to pull apart from this commit, sorry about that!


Overall, I know this is a very large commit. I am super happy to have some synchronous meetings to walk through this all, give an overview of the architecture, answer questions directly, etc... to make review easier!

Footnotes

  1. See "Compressed OOPs" in
    OpenJDK.

  2. See V8's pointer
    compression
    .

@fitzgen fitzgen requested review from alexcrichton and dicej March 20, 2024 20:57
@fitzgen fitzgen requested review from a team as code owners March 20, 2024 20:57
@fitzgen fitzgen requested review from cfallin and removed request for a team March 20, 2024 20:57
@github-actions github-actions Bot added cranelift Issues related to the Cranelift code generator cranelift:wasm fuzzing Issues related to our fuzzing infrastructure labels Mar 20, 2024
@fitzgen fitzgen removed the request for review from cfallin March 20, 2024 21:05
@github-actions
Copy link
Copy Markdown

Subscribe to Label Action

cc @fitzgen

Details This issue or pull request has been labeled: "cranelift", "cranelift:wasm", "fuzzing"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: fuzzing

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Comment thread crates/runtime/src/table.rs Outdated
@fitzgen fitzgen force-pushed the i31ref branch 3 times, most recently from d162c8c to b8337ee Compare March 22, 2024 21:58
Copy link
Copy Markdown
Contributor

@dicej dicej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I'm excited to see this move forward.

Please see a few inline comments and suggestions. The only one that might be a blocker is the cast from a pointer to a 64-bit value to a pointer to a 32-bit value due to endianness concerns. Based on our earlier conversation, sounds like you're planning to get rid of that anyway.

Comment thread crates/cranelift/src/gc/enabled.rs
Comment thread crates/runtime/src/gc/gc_runtime.rs Outdated
Comment thread crates/runtime/src/libcalls.rs Outdated
Comment thread crates/runtime/src/libcalls.rs Outdated
Comment thread crates/types/src/lib.rs Outdated
Comment thread crates/wasmtime/src/runtime/gc/enabled/anyref.rs Outdated
Comment thread crates/wasmtime/src/runtime/types.rs Outdated
Comment thread crates/wasmtime/src/runtime/types/matching.rs
Comment thread crates/wasmtime/src/runtime/store.rs Outdated
Comment thread crates/wasmtime/src/runtime/store.rs Outdated
Comment thread crates/wasmtime/src/runtime/gc/enabled/externref.rs Outdated
@fitzgen fitzgen force-pushed the i31ref branch 9 times, most recently from c929664 to fdf1159 Compare April 2, 2024 13:42
@fitzgen fitzgen enabled auto-merge April 2, 2024 19:05
@fitzgen fitzgen force-pushed the i31ref branch 4 times, most recently from 8f45761 to 44c7308 Compare April 2, 2024 21:15
@fitzgen fitzgen requested a review from a team as a code owner April 2, 2024 21:15
@fitzgen fitzgen added this pull request to the merge queue Apr 2, 2024
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024
@fitzgen fitzgen enabled auto-merge April 2, 2024 21:41
@github-actions github-actions Bot added the wasmtime:c-api Issues pertaining to the C API. label Apr 2, 2024
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2024

Subscribe to Label Action

cc @peterhuene

Details This issue or pull request has been labeled: "wasmtime:c-api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:c-api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@fitzgen fitzgen added this pull request to the merge queue Apr 2, 2024
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024
@fitzgen fitzgen force-pushed the i31ref branch 2 times, most recently from 544066f to 80eb2a2 Compare April 3, 2024 18:56
@fitzgen fitzgen enabled auto-merge April 3, 2024 18:56
@fitzgen fitzgen force-pushed the i31ref branch 2 times, most recently from 5ce2532 to 9fdf5db Compare April 3, 2024 23:04
\### The `GcRuntime` and `GcCompiler` Traits

This commit factors out the details of the garbage collector away from the rest
of the runtime and the compiler. It does this by introducing two new traits,
very similar to a subset of [those proposed in the Wasm GC RFC], although not
all equivalent functionality has been added yet because Wasmtime doesn't
support, for example, GC structs yet:

[those proposed in the Wasm GC RFC]: https://github.com/bytecodealliance/rfcs/blob/main/accepted/wasm-gc.md#defining-the-pluggable-gc-interface

1. The `GcRuntime` trait: This trait defines how to create new GC heaps, run
   collections within them, and execute the various GC barriers the collector
   requires.

   Rather than monomorphize all of Wasmtime on this trait, we use it
   as a dynamic trait object. This does imply some virtual call overhead and
   missing some inlining (and resulting post-inlining) optimization
   opportunities. However, it is *much* less disruptive to the existing embedder
   API, results in a cleaner embedder API anyways, and we don't believe that VM
   runtime/embedder code is on the hot path for working with the GC at this time
   anyways (that would be the actual Wasm code, which has inlined GC barriers
   and direct calls and all of that). In the future, once we have optimized
   enough of the GC that such code is ever hot, we have options we can
   investigate at that time to avoid these dynamic virtual calls, like only
   enabling one single collector at build time and then creating a static type
   alias like `type TheOneGcImpl = ...;` based on the compile time
   configuration, and using this type alias in the runtime rather than a dynamic
   trait object.

   The `GcRuntime` trait additionally defines a method to reset a GC heap, for
   use by the pooling allocator. This allows reuse of GC heaps across different
   stores. This integration is very rudimentary at the moment, and is missing
   all kinds of configuration knobs that we should have before deploying Wasm GC
   in production. This commit is large enough as it is already! Ideally, in the
   future, I'd like to make it so that GC heaps receive their memory region,
   rather than allocate/reserve it themselves, and let each slot in the pooling
   allocator's memory pool be *either* a linear memory or a GC heap. This would
   unask various capacity planning questions such as "what percent of memory
   capacity should we dedicate to linear memories vs GC heaps?". It also seems
   like basically all the same configuration knobs we have for linear memories
   apply equally to GC heaps (see also the "Indexed Heaps" section below).

2. The `GcCompiler` trait: This trait defines how to emit CLIF that implements
   GC barriers for various operations on GC-managed references. The Rust code
   calls into this trait dynamically via a trait object, but since it is
   customizing the CLIF that is generated for Wasm code, the Wasm code itself is
   not making dynamic, indirect calls for GC barriers. The `GcCompiler`
   implementation can inline the parts of GC barrier that it believes should be
   inline, and leave out-of-line calls to rare slow paths.

All that said, there is still only a single implementation of each of these
traits: the existing deferred reference-counting (DRC) collector. So there is a
bunch of code motion in this commit as the DRC collector was further isolated
from the rest of the runtime and moved to its own submodule. That said, this was
not *purely* code motion (see "Indexed Heaps" below) so it is worth not simply
skipping over the DRC collector's code in review.

\### Indexed Heaps

This commit does bake in a couple assumptions that must be shared across all
collector implementations, such as a shared `VMGcHeader` that all objects
allocated within a GC heap must begin with, but the most notable and
far-reaching of these assumptions is that all collectors will use "indexed
heaps".

What we are calling indexed heaps are basically the three following invariants:

1. All GC heaps will be a single contiguous region of memory, and all GC objects
   will be allocated within this region of memory. The collector may ask the
   system allocator for additional memory, e.g. to maintain its free lists, but
   GC objects themselves will never be allocated via `malloc`.

2. A pointer to a GC-managed object (i.e. a `VMGcRef`) is a 32-bit offset into
   the GC heap's contiguous region of memory. We never hold raw pointers to GC
   objects (although, of course, we have to compute them and use them
   temporarily when actually accessing objects). This means that deref'ing GC
   pointers is equivalent to deref'ing linear memory pointers: we need to add a
   base and we also check that the GC pointer/index is within the bounds of the
   GC heap. Furthermore, compressing 64-bit pointers into 32 bits is a fairly
   common technique among high-performance GC
   implementations[^compressed-oops][^v8-ptr-compression] so we are in good
   company.

3. Anything stored inside the GC heap is untrusted. Even each GC reference that
   is an element of an `(array (ref any))` is untrusted, and bounds checked on
   access. This means that, for example, we do not store the raw pointer to an
   `externref`'s host object inside the GC heap. Instead an `externref` now
   stores an ID that can be used to index into a side table in the store that
   holds the actual `Box<dyn Any>` host object, and accessing that side table is
   always checked.

[^compressed-oops]: See ["Compressed OOPs" in
    OpenJDK.](https://wiki.openjdk.org/display/HotSpot/CompressedOops)

[^v8-ptr-compression]: See [V8's pointer
    compression](https://v8.dev/blog/pointer-compression).

The good news with regards to all the bounds checking that this scheme implies
is that we can use all the same virtual memory tricks that linear memories use
to omit explicit bounds checks. Additionally, (2) means that the sizes of GC
objects is that much smaller (and therefore that much more cache friendly)
because they are only holding onto 32-bit, rather than 64-bit, references to
other GC objects. (We can, in the future, support GC heaps up to 16GiB in size
without losing 32-bit GC pointers by taking advantage of `VMGcHeader` alignment
and storing aligned indices rather than byte indices, while still leaving the
bottom bit available for tagging as an `i31ref` discriminant. Should we ever
need to support even larger GC heap capacities, we could go to full 64-bit
references, but we would need explicit bounds checks.)

The biggest benefit of indexed heaps is that, because we are (explicitly or
implicitly) bounds checking GC heap accesses, and because we are not otherwise
trusting any data from inside the GC heap, we greatly reduce how badly things
can go wrong in the face of collector bugs and GC heap corruption. We are
essentially sandboxing the GC heap region, the same way that linear memory is a
sandbox. GC bugs could lead to the guest program accessing the wrong GC object,
or getting garbage data from within the GC heap. But only garbage data from
within the GC heap, never outside it. The worse that could happen would be if we
decided not to zero out GC heaps between reuse across stores (which is a valid
trade off to make, since zeroing a GC heap is a defense-in-depth technique
similar to zeroing a Wasm stack and not semantically visible in the absence of
GC bugs) and then a GC bug would allow the current Wasm guest to read old GC
data from the old Wasm guest that previously used this GC heap. But again, it
could never access host data.

Taken altogether, this allows for collector implementations that are nearly free
from `unsafe` code, and unsafety can otherwise be targeted and limited in scope,
such as interactions with JIT code. Most importantly, we do not have to maintain
critical invariants across the whole system -- invariants which can't be nicely
encapsulated or abstracted -- to preserve memory safety. Such holistic
invariants that refuse encapsulation are otherwise generally a huge safety
problem with GC implementations.

\### `VMGcRef` is *NOT* `Clone` or `Copy` Anymore

`VMGcRef` used to be `Clone` and `Copy`. It is not anymore. The motivation here
was to be sure that I was actually calling GC barriers at all the correct
places. I couldn't be sure before. Now, you can still explicitly copy a raw GC
reference without running GC barriers if you need to and understand why that's
okay (aka you are implementing the collector), but that is something you have to
opt into explicitly by calling `unchecked_copy`. The default now is that you
can't just copy the reference, and instead call an explicit `clone` method (not
*the* `Clone` trait, because we need to pass in the GC heap context to run the
GC barriers) and it is hard to forget to do that accidentally. This resulted in
a pretty big amount of churn, but I am wayyyyyy more confident that the correct
GC barriers are called at the correct times now than I was before.

\### `i31ref`

I started this commit by trying to add `i31ref` support. And it grew into the
whole traits interface because I found that I needed to abstract GC barriers
into helpers anyways to avoid running them for `i31ref`s, so I figured that I
might as well add the whole traits interface. In comparison, `i31ref` support is
much easier and smaller than that other part! But it was also difficult to pull
apart from this commit, sorry about that!

---------------------

Overall, I know this is a very large commit. I am super happy to have some
synchronous meetings to walk through this all, give an overview of the
architecture, answer questions directly, etc... to make review easier!

prtest:full
@fitzgen fitzgen added this pull request to the merge queue Apr 4, 2024
Merged via the queue into bytecodealliance:main with commit 0fa1301 Apr 4, 2024
@fitzgen fitzgen deleted the i31ref branch April 4, 2024 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:wasm cranelift Issues related to the Cranelift code generator fuzzing Issues related to our fuzzing infrastructure wasmtime:c-api Issues pertaining to the C API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants