Hi,
I performance-tested a no-opt version of Cranelift without ISLE mid-end optimizations. (You can confirmed this at prosyslab@c0df585)
Compared to the latest version of upstream Cranelift, surprisingly, the no-opt version produced a significantly faster x86_64 code for blake3, keccak, xchacha20. This experiment is conducted using the sightglass-cli.
Given that only mid-end rules are removed, some codegen backend might causing problem interacting with the mid-end. I want to investigate this problem, but I'm completely lost which part to look at first. Any comments will be appreciated.
Here is the demonstration:
> cargo run --release -- benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin
Finished `release` profile [optimized] target(s) in 0.08s
Running `target/release/sightglass-cli benchmark benchmarks/blake3-scalar/benchmark.wasm --engine engines/wasmtime/v-main/libengine.so engines/wasmtime/v-no-opts/libengine.so --pin`
execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm
Δ = 502562.18 ± 4271.59 (confidence = 99%)
no-opts/libengine.so is 2.55x to 2.58x faster than main/libengine.so!
[816778 823363.48 880808] main/libengine.so
[315506 320801.30 460046] no-opts/libengine.so
compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm
Δ = 51372480.98 ± 1109163.62 (confidence = 99%)
no-opts/libengine.so is 1.19x to 1.20x faster than main/libengine.so!
[310860540 313559414.86 330384818] main/libengine.so
[260245364 262186933.88 278160088] no-opts/libengine.so
instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm
Δ = 17444.42 ± 10445.70 (confidence = 99%)
no-opts/libengine.so is 1.06x to 1.23x faster than main/libengine.so!
[90902 137381.98 258206] main/libengine.so
[87766 119937.56 198140] no-opts/libengine.so
Plus, here are some data for other benchmarks.
--iterations-per-process 10 --benchmark-phase execution ----pin is used.
| bench |
v-no-opts |
base |
speedup |
| blake3-scalar |
320,225 |
868,750 |
-63.14% |
| blake3-simd |
320,689 |
945,427 |
-66.08% |
| bz2 |
88,887,466 |
86,904,121 |
2.28% |
| pulldown-cmark |
6,630,447 |
6,705,562 |
-1.12% |
| regex |
209,902,394 |
211,477,705 |
-0.74% |
| shootout-base64 |
383,700,851 |
352,817,318 |
8.75% |
| shootout-keccak |
25,589,899 |
49,540,506 |
-48.35% |
| shootout-xchacha20 |
4,489,570 |
4,816,315 |
-6.78% |
| spidermonkey |
644,434,235 |
627,374,660 |
2.72% |
Hi,
I performance-tested a no-opt version of Cranelift without ISLE mid-end optimizations. (You can confirmed this at prosyslab@c0df585)
Compared to the latest version of upstream Cranelift, surprisingly, the no-opt version produced a significantly faster x86_64 code for blake3, keccak, xchacha20. This experiment is conducted using the sightglass-cli.
Given that only mid-end rules are removed, some codegen backend might causing problem interacting with the mid-end. I want to investigate this problem, but I'm completely lost which part to look at first. Any comments will be appreciated.
Here is the demonstration:
Plus, here are some data for other benchmarks.
--iterations-per-process 10 --benchmark-phase execution ----pinis used.