Reimplement stubs to improve performance#65738
Merged
janvorli merged 11 commits intodotnet:mainfrom Mar 17, 2022
Merged
Conversation
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change implements
FixupPrecodeStub,PrecodeStub,CallCountingStuband VSD stubsLookupStub,DispatchStubandResolveStubusing a new mechanism with fixed code and separate RW data. TheLoaderHeapwas updated to support a new kind of allocation using interleaved code and data pages to support this new mechanism.The JIT now generates code that uses indirection slot to jump to the methods using
FixupPrecode, improving performance of the ASPNet plaintext benchmark by 3-4% depending on the target platform (measured on x64 Windows / Linux and arm64 Linux).I have also removed the Holders, as the stubs are naturally properly aligned due to the way they are allocated.
There is now only a single variant of each stub, there are no long / short ones anymore as they are not needed - the indirect jumps we use now are not range limited.
Most of the stubs stuff is now target agnostic and the originally split implementation is now in single place for all targets. Only a few constants are defined as target specific in these.
The code for the stubs is no longer generated as bytes by C++ code, but rather written in asm and compiled. These precompiled templates are then used as a source to copy the code from. The x86 is a bit more complex than that due to the fact that it doesn't support PC relative indirect addressing, so we need to relocate all access to the data slots when generating the code pages.
As a further improvement, we could generate just a single page of the code and then just map it many times. This is left for future work.
ARM64 Unix differs from the other targets / platforms - there are various page sizes being used. So the asm templates are generated for 4k..64k page sizes and the variant is then picked at runtime based on the page size extracted from the OS.
This also removes a lot of writeable mappings created for modifications of the stub code when W^X is enabled, in the plaintext benchmark they were reduced by 75%. That results in a significant reducing of the .NET application startup time with W^X enabled.
I think the
LoaderHeapwould benefit from some refactoring, but I'd prefer leaving it for a follow up. It seems that for the sake of the review, it is better to keep it as is.The change also implements logging of number of mappings and their exact locations. This helped me to drive the work and I am planning to use it for further changes. It can be removed in the future once we reach a final state.
There are still opportunities for improvement, but these stubs allowed me to scrape off the most significant portion of the mappings.