Stabilize performance measurement

The code quality and performance of RyuJIT is tracked internally by running [MicroBenchmarks](https://github.com/dotnet/performance/tree/master/src/benchmarks/micro) in our performance lab.  We regularly triage the [performance issues](https://github.com/dotnet/runtime/issues?q=%22%5BPerf%22+in%3Atitle+author%3ADrewScoggins) opened by the .NET performance team. After going through these issues for past several months, we have identified some key points.

### Stability
Many times, the set of commits that are flagged as introducing regression in a benchmark, do not touch the code that is tested in the benchmark. In fact, the assembly code generated for the .NET code that is being tested is often identical and yet the measurements show differences. Some of our investigation reveals that the fluctuation in the benchmark measurements happen because of the misalignment of generated JIT code in process memory. Below is an example of [LoopReturn](https://github.com/dotnet/performance/blob/8aed638c9ee65c034fe0cca4ea2bdc3a68d2a6b5/src/benchmarks/micro/runtime/Layout/SearchLoops.cs#L30) benchmark that shows such behavior. 

![image](https://user-images.githubusercontent.com/12488060/95612182-cf194f00-0a17-11eb-84fe-cd6831006501.png)

It is very time consuming for .NET developers to do the analysis of benchmarks that regressed because of things that are out of control of .NET runtime. In the past, we have closed several issues like https://github.com/dotnet/runtime/issues/13770, https://github.com/dotnet/runtime/issues/39721 and https://github.com/dotnet/runtime/issues/39722 because they were regressions because of code alignment. A great example that we found out while investigating those issues was the change introduced in https://github.com/dotnet/runtime/pull/38586 eliminated a `test` instruction and should have showed improvement in the benchmarks, but introduced regression because the code (loop code inside method) now gets misaligned and the method runs slower.

Alignment issues was brought up few times in https://github.com/dotnet/runtime/issues/9912 and https://github.com/dotnet/runtime/issues/8108 and this issue tracks the progress towards the goal of stabilizing and possibly improving the performance of .NET apps that are heavily affected because of code alignment.

### Performance lab infrastructure
Once we address the code alignment issue, the next big thing will be to identify and make required infrastructure changes in our performance lab to make sure that it can easily flag such issues without needing much interaction from .NET developers. For example, https://github.com/dotnet/BenchmarkDotNet/issues/1513 proposes to make memory alignment in the benchmark run random to catch these issues early and once we address the underlying problem in .NET, we should never see bimodal behavior of those benchmarks. After that, if the performance lab does find a regression in the benchmark, we need to have robust tooling support to get possible metrics from performance runs so that a developer doing the investigation can easily identify the cause of regression. For example, identifying the time spent in various phases of .NET runtime like Jitting, Jit interface, Tier0/Tier1 JIT code, hot methods, instructions retired during benchmark execution and so forth.

### Reliable benchmarks collection
Lastly, for developers working on JIT, we want to identify set of benchmarks that are stable enough and can be trusted to give us reliable measurement whenever there is a need to verify the performance for changes done to the JIT codebase. This will help us conduct performance testing ahead of time and identify potential regressions rather than waiting it to happen in performance lab.

Here are set of work items that we have identified to achieve all the above:

### Code alignment work
* [x] Make method entries of JIT code having loops aligned at 32-byte boundary for xarch. (**Done in https://github.com/dotnet/runtime/pull/42909**)
* [x] Devise a mechanism to identify inner loops (the ones that don't have more loops inside) present in a method. (**Done: #44370**)
* [x] Experiment with heuristics that should decide if a particular inner loop needs alignment or not. (**Done: #44370**)
* [x] Calculate the padding that needs to be added to align the identified inner loop. (**Done: #44370**)
* [x] Basic: Add the padding near the loop header (either at the end of previous block or at the beginning of the loop header). (**Done: #44370**)
* [x] Before merging above work, measure the impact of loop alignment on Microbenchmarks. Tracked in https://github.com/dotnet/runtime/issues/44051. Also in https://github.com/dotnet/runtime/issues/43227#issuecomment-765684981
* [x] Identify benchmarks that get stabilized or whose performance is improved by the above work. (**Update**: See the analysis starting from https://github.com/dotnet/runtime/issues/43227#issuecomment-765684981)

### Future work
* [x] Make method entries of JIT code having loops aligned at relevant boundary for arm32/arm64. See section 4.8 of  https://developer.arm.com/documentation/swog309707/a.
* [ ] Combine branch tightening and loop alignment adjustments in single phase.
* [ ] Advanced: If padding proves to be costly, have a way to spread the padding throughout the method so the loop header gets aligned. Account the padding while doing branch tensioning.
   * [ ] Add padding at the blind spot like after `jmp` or `ret` instruction that comes before `align` instruction.
   * [ ] Add padding at the end of blocks that has lower weight than the block that precedes the loop header block.
   * [ ] Explore option to have a method misaligned such that we can skip the padding needed for loops and they get auto-aligned or aligned with minimal padding.
 * [ ] Perform loop alignment similar to #44370 for R2R code.
 * [ ] For x86, we should improve the encoding we use for multiple size `NOP` instructions. Today, we just output [repeated single byte](https://github.com/dotnet/runtime/blob/9a09cee7b96d43b1700710b4d2b7a03be656a34b/src/coreclr/jit/emitxarch.cpp#L9918-L9973) `90` , but could do better like we do for x64. 
 * [ ] Explore option to align enclosing loops if it can borrow some of the padding needed for inner loop. E.g. in a nested loop, if there is an inner loop that needs padding of 10 bytes and the outer loop can be aligned by adding padding of 4 bytes, then add padding of 4 bytes to outer loop and 6 bytes to inner loop. That way, both the loops are aligned.

### Performance tooling work
* [x] Add random memory alignment. Tracked by https://github.com/dotnet/BenchmarkDotNet/issues/1513 and https://github.com/dotnet/performance/pull/1587
* [ ] Add various tooling in PerfView, BenchmarkDotNet analyzers to add the required metric. (**WIP**)
* [x] Perform superpmi collection of microbenchmarks so codegen team can easily test the impact of their change on generated code of various benchmarks.  (https://github.com/dotnet/runtime/pull/47900/)
* [x] Add a new event to track memory allocated for JIT code that can be surfaced in PerfView through JITStats. PR: https://github.com/dotnet/runtime/pull/44030 and https://github.com/microsoft/perfview/pull/1289


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize performance measurement #43227

Stability

Performance lab infrastructure

Reliable benchmarks collection

Code alignment work

Future work

Performance tooling work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stabilize performance measurement #43227

Description

Stability

Performance lab infrastructure

Reliable benchmarks collection

Code alignment work

Future work

Performance tooling work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions