Reimplement stubs to improve performance by janvorli · Pull Request #65738 · dotnet/runtime

janvorli · 2022-02-22T21:21:10Z

This change implements FixupPrecodeStub, PrecodeStub, CallCountingStub and VSD stubs LookupStub, DispatchStub and ResolveStub using a new mechanism with fixed code and separate RW data. The LoaderHeap was updated to support a new kind of allocation using interleaved code and data pages to support this new mechanism.
The JIT now generates code that uses indirection slot to jump to the methods using FixupPrecode, improving performance of the ASPNet plaintext benchmark by 3-4% depending on the target platform (measured on x64 Windows / Linux and arm64 Linux).

I have also removed the Holders, as the stubs are naturally properly aligned due to the way they are allocated.

There is now only a single variant of each stub, there are no long / short ones anymore as they are not needed - the indirect jumps we use now are not range limited.

Most of the stubs stuff is now target agnostic and the originally split implementation is now in single place for all targets. Only a few constants are defined as target specific in these.

The code for the stubs is no longer generated as bytes by C++ code, but rather written in asm and compiled. These precompiled templates are then used as a source to copy the code from. The x86 is a bit more complex than that due to the fact that it doesn't support PC relative indirect addressing, so we need to relocate all access to the data slots when generating the code pages.

As a further improvement, we could generate just a single page of the code and then just map it many times. This is left for future work.

ARM64 Unix differs from the other targets / platforms - there are various page sizes being used. So the asm templates are generated for 4k..64k page sizes and the variant is then picked at runtime based on the page size extracted from the OS.

This also removes a lot of writeable mappings created for modifications of the stub code when W^X is enabled, in the plaintext benchmark they were reduced by 75%. That results in a significant reducing of the .NET application startup time with W^X enabled.

I think the LoaderHeap would benefit from some refactoring, but I'd prefer leaving it for a follow up. It seems that for the sake of the review, it is better to keep it as is.

The change also implements logging of number of mappings and their exact locations. This helped me to drive the work and I am planning to use it for further changes. It can be removed in the future once we reach a final state.

There are still opportunities for improvement, but these stubs allowed me to scrape off the most significant portion of the mappings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement stubs to improve performance#65738

Reimplement stubs to improve performance#65738
janvorli merged 11 commits intodotnet:mainfrom
janvorli:new-stubs

janvorli commented Feb 22, 2022

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

janvorli commented Feb 22, 2022

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants