Fix performance regression in hvcat of simple matrices#57422
Fix performance regression in hvcat of simple matrices#57422BioTurboNick wants to merge 12 commits intoJuliaLang:masterfrom
Conversation
|
I don't think the test failure is related? It occurred testing |
|
There's unfortunately extra overhead for everything else not intended to be addressed, looks like mostly because EDIT: Ugh, there seems to be an annoying trade-off in performance. I'll need to explore further. |
|
I believe I got it. The overhead of the block copy was too much for small arrays, so I added a branch to use the original loop for those. Crossover point seemed to be around 4-8 elements, so I branched at >4. Two other aspects addressed:
EDIT: There was one trade-off I didn't find an optimal solution for, and settled on resolving in favor of all-arrays as the more common choice (no change from master). If the elements to cat are all arrays, then the dimension-calculation in |
|
@nanosoldier |
|
Your job failed. |
|
Is the nanosoldier failure something to do with the PR, or does it just need to be rerun? |
|
@vtjnash - should nanosoldier be rerun, or is something wrong that I need to fix? |
|
@nanosoldier |
|
Your job failed. |
|
It is a bug in BaseBenchmarks |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
|
Thanks for getting it running! I'm guessing some of these are noise, and that the high-% ones indicate higher-variance tests, but these stand out for me to check into:
|
|
That higher variance one is a constant folded expression, so those just test the reliability of the CPU branch predictor |
|
None of the benchmarks of concern validated locally, so I think this PR is good to go? Saw differentials of 0.9-1.10 for all of the questionable ones between master, and this branch rebased onto master, and I don't think they touch the cat code anyway. Summary: ~2x speedup for a range of hvcat-family operations. |
|
I see a few regressions locally: master: pr: |
I admit I'm confused, because I'm not touching the These examples go through Line 1750 in 5fe89b8 EDIT: Oh perhaps it's related to the |
|
well now I'm also confused because the performance is not stable across julia sessions... (on master as well) so those "regressions" were just noise. but that is a lot of noise
In this case I was running before loading SparseArrays |
Co-authored-by: Andy Dienes <51664769+adienes@users.noreply.github.com>
c5e9be3 to
8fb69a8
Compare
|
I couldn't reproduce these slowdowns, unfortunately. I reverted the |
|
I noticed that in this PR, I have the start of a fix for this regression: JuliaGPU/GPUArrays.jl#672 I would just need to gate the small-array scalar optimization on |
|
@KristofferC can we please get this into 1.12? |
|
I think the consequences are still not entirely evaluated. e.g. I'm seeing this regression: it might be easier to merge individual independent pieces of this PR. for example like the precomputed |
So, turns out this is what the |
|
and after #59025 (on top of this PR), it would be way faster still as I understand it, this PR contains four mostly independent optimizations:
the first three seem like pretty safe improvements. it's the last of these that strikes me as more fragile and harder to evaluate. for example |
|
How will #61426 interact with this? What would you recommend I look at to test the block-copying better? Also I mentioned earlier that strictly making the for loop optimization only applicable to |
As pointed out by @Zentrik here, recent Nanosoldier runs suggested a significant performance regression for simple
hvcats as a result of #39729 .I revisted the code and determined that the main cause was that
typed_hvncatiterated over each element and had to calculate the linear index for each one, resulting in many multiplication and addition operations for each element. I realized thatCartesianIndicescould be used to restore the copy-as-a-whole pattern thattyped_hvcatused, while retaining generality for arbitrary dimensions.As I recall, I believe a limitation when I wrote the
hvncatcode was that certain features were not available during compiler bootstrapping, requiring fully manual indexing. Since the compiler has been made an stdlib, I believe that made this PR possible.Before merging I would also want to check that I didn't hurt theDonehvncatperformance at all.This should ideally be marked for 1.12 backport.