Skip to content

Fix performance regression in hvcat of simple matrices#57422

Open
BioTurboNick wants to merge 12 commits intoJuliaLang:masterfrom
BioTurboNick:fix-perf-hvcat-mats
Open

Fix performance regression in hvcat of simple matrices#57422
BioTurboNick wants to merge 12 commits intoJuliaLang:masterfrom
BioTurboNick:fix-perf-hvcat-mats

Conversation

@BioTurboNick
Copy link
Copy Markdown
Contributor

@BioTurboNick BioTurboNick commented Feb 15, 2025

As pointed out by @Zentrik here, recent Nanosoldier runs suggested a significant performance regression for simple hvcats as a result of #39729 .

I revisted the code and determined that the main cause was that typed_hvncat iterated over each element and had to calculate the linear index for each one, resulting in many multiplication and addition operations for each element. I realized that CartesianIndices could be used to restore the copy-as-a-whole pattern that typed_hvcat used, while retaining generality for arbitrary dimensions.

As I recall, I believe a limitation when I wrote the hvncat code was that certain features were not available during compiler bootstrapping, requiring fully manual indexing. Since the compiler has been made an stdlib, I believe that made this PR possible.

Before merging I would also want to check that I didn't hurt the hvncat performance at all. Done

This should ideally be marked for 1.12 backport.

@KristofferC KristofferC added performance Must go faster backport 1.12 Change should be backported to release-1.12 labels Feb 15, 2025
@BioTurboNick
Copy link
Copy Markdown
Contributor Author

BioTurboNick commented Feb 15, 2025

I don't think the test failure is related? It occurred testing Profile module... EDIT: Yep, not related.

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

BioTurboNick commented Feb 15, 2025

There's unfortunately extra overhead for everything else not intended to be addressed, looks like mostly because getindex with CartesianIndices unfortunately relies on slow integer division via _ind2sub.

EDIT: Ugh, there seems to be an annoying trade-off in performance. I'll need to explore further.

@KristofferC KristofferC mentioned this pull request Feb 15, 2025
31 tasks
@BioTurboNick
Copy link
Copy Markdown
Contributor Author

BioTurboNick commented Feb 16, 2025

I believe I got it. The overhead of the block copy was too much for small arrays, so I added a branch to use the original loop for those. Crossover point seemed to be around 4-8 elements, so I branched at >4.

Two other aspects addressed:

  • 1d arrays of pure numbers were a bit slow compared with cat, so I adopted its approach
  • Identified significant performance reduction in an important case (see below), and found unusual time spent in setindex_shape_check. Adding @inline eliminated the bottleneck entirely, though could that be a symptom of a broader regression?
const x = [1 2; 3 4;;; 5 6; 7 8] # cat([1 2; 3 4], [5 6; 7 8], dims=3)
const y = x .+ 1
e17() = [x ;;; x ;;;; y ;;; y] # 99.356 ns (6 allocations: 544 bytes), was 4x slower and many more allocations

EDIT: There was one trade-off I didn't find an optimal solution for, and settled on resolving in favor of all-arrays as the more common choice (no change from master). If the elements to cat are all arrays, then the dimension-calculation in _typed_hvncat_dims is more efficient iterating over eachindex of the tuple and indexing into it. If the elements are a mixture of arrays and scalars, then iterating over the elements with enumerate is more efficient. If the situations are swapped, there's substantial overhead indexing into the tuple (mixed arrays and scalars), or substantial overhead performing the iteration itself (just arrays). Ultimately not a big impact, but a bit of gripe that the compiler can be fickle like that.

@KristofferC KristofferC mentioned this pull request Feb 17, 2025
24 tasks
This was referenced Mar 24, 2025
@KristofferC KristofferC mentioned this pull request Apr 4, 2025
51 tasks
@KristofferC KristofferC mentioned this pull request Apr 29, 2025
53 tasks
@KristofferC KristofferC mentioned this pull request May 9, 2025
58 tasks
@KristofferC KristofferC mentioned this pull request Jun 6, 2025
60 tasks
@KristofferC KristofferC mentioned this pull request Jul 22, 2025
20 tasks
@KristofferC KristofferC mentioned this pull request Aug 6, 2025
38 tasks
@KristofferC KristofferC mentioned this pull request Aug 19, 2025
27 tasks
This was referenced Sep 24, 2025
Copy link
Copy Markdown
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@vtjnash
Copy link
Copy Markdown
Member

vtjnash commented Oct 16, 2025

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Copy Markdown
Collaborator

Your job failed.

@KristofferC KristofferC mentioned this pull request Oct 21, 2025
35 tasks
@BioTurboNick
Copy link
Copy Markdown
Contributor Author

Is the nanosoldier failure something to do with the PR, or does it just need to be rerun?

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

@vtjnash - should nanosoldier be rerun, or is something wrong that I need to fix?

@vtjnash
Copy link
Copy Markdown
Member

vtjnash commented Nov 21, 2025

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Copy Markdown
Collaborator

Your job failed.

@vtjnash
Copy link
Copy Markdown
Member

vtjnash commented Nov 21, 2025

It is a bug in BaseBenchmarks

      From worker 3:    ERROR: LoadError: UndefVarError: `WorldView` not defined in `BaseBenchmarks.InferenceBenchmarks`
      From worker 3:    Suggestion: this global was defined as `Compiler.WorldView` but not assigned a value.
      From worker 3:    Stacktrace:
      From worker 3:      [1] top-level scope
      From worker 3:        @ /home/nanosoldier/.julia/dev/BaseBenchmarks/src/inference/InferenceBenchmarks.jl:88

@nanosoldier
Copy link
Copy Markdown
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

Thanks for getting it running!

I'm guessing some of these are noise, and that the high-% ones indicate higher-variance tests, but these stand out for me to check into:

test time memory
["array", "cat", ("catnd", 5)] 0.97 (5%) 0.93 (1%) ✅
["array", "cat", ("catnd_setind", 5)] 0.51 (5%) ✅ 0.89 (1%) ✅
["array", "cat", ("hvcat", 5)] 0.64 (5%) ✅ 1.00 (1%)
["array", "cat", ("hvcat", 500)] 0.53 (5%) ✅ 1.00 (1%)
["sparse", "transpose", ("adjoint", "(20000, 10000)")] 0.59 (30%) ✅ 1.00 (1%)
["union", "array", ("skipmissing", "perf_sumskipmissing", "Union{Nothing, Int64}", 0)] 0.80 (5%) ✅ 1.00 (1%)
["find", "findprev", ("Vector{Bool}", "50-50")] 3.21 (5%) ❌ 1.00 (1%)
["scalar", "atan2", ("x one", "Float32")] 1.42 (5%) ❌ 1.00 (1%)
["scalar", "intfuncs", ("# 6", "UInt64", "+")] 1.57 (25%) ❌ 1.00 (1%)
["sparse", "transpose", ("adjoint", "(20000, 20000)")] 8.45 (30%) ❌ 1.00 (1%)
["tuple", "reduction", ("sum", "(8, 8)")] 1.34 (5%) ❌ 1.00 (1%)

@vtjnash
Copy link
Copy Markdown
Member

vtjnash commented Feb 18, 2026

That higher variance one is a constant folded expression, so those just test the reliability of the CPU branch predictor

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

None of the benchmarks of concern validated locally, so I think this PR is good to go? Saw differentials of 0.9-1.10 for all of the questionable ones between master, and this branch rebased onto master, and I don't think they touch the cat code anyway.

Summary: ~2x speedup for a range of hvcat-family operations.

@adienes
Copy link
Copy Markdown
Member

adienes commented Feb 19, 2026

I see a few regressions locally:

master:

julia> @benchmark vcat(x, y) setup=(x=rand(1, 100); y=rand(1, 100))
BenchmarkTools.Trial: 10000 samples with 192 evaluations per sample.
 Range (min … max):  508.010 ns … 50.128 μs  ┊ GC (min … max):  0.00% … 98.55%
 Time  (median):     533.505 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   679.812 ns ±  1.098 μs  ┊ GC (mean ± σ):  11.34% ± 11.14%

  █▃▃▂             ▁▃▂                                         ▁
  ████▅▄▄▃▃▁▃▁▃▁▃▁▃███▇▇▆▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆█▄▇ █
  508 ns        Histogram: log(frequency) by time      3.65 μs <

 Memory estimate: 1.64 KiB, allocs estimate: 2.

julia> @benchmark vcat(x, y) setup=(x=rand(2, 2); y=rand(2, 2))
BenchmarkTools.Trial: 10000 samples with 986 evaluations per sample.
 Range (min … max):  53.007 ns …  11.198 μs  ┊ GC (min … max):  0.00% … 98.34%
 Time  (median):     56.499 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   78.872 ns ± 253.613 ns  ┊ GC (mean ± σ):  14.23% ±  5.81%

  ▃█▇▅▂    ▁▁                                  ▁▂▂▂▂▂▂▂▁       ▂
  ██████▇▇▇███▇▇▇▇▇▆▇▆▆▆▆▇▇▆▆▇▅▄▃▁▁▁▁▁▃▁▁▁▃▁▁▁██████████▇▇▇▆▇▇ █
  53 ns         Histogram: log(frequency) by time       150 ns <

 Memory estimate: 144 bytes, allocs estimate: 2.

julia> using SparseArrays

julia> @benchmark vcat(x, y) setup=(x=sprand(10, 10, 0.3); y=sprand(10, 10, 0.3))
BenchmarkTools.Trial: 10000 samples with 189 evaluations per sample.
 Range (min … max):  494.032 ns … 63.577 μs  ┊ GC (min … max):  0.00% … 98.08%
 Time  (median):     627.537 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   793.259 ns ±  1.513 μs  ┊ GC (mean ± σ):  13.05% ±  8.97%

   ▃▄▅▅██▇▆▆▆▆▅▄▃▃▃▃▂▂▁▁         ▁ ▁▂▂▁▂▂▁▁                    ▃
  ▇████████████████████████▇█▇▇▇████████████████▆▇▆▆▃▅▅▄▃▄▁▁▁▃ █
  494 ns        Histogram: log(frequency) by time      1.58 μs <

 Memory estimate: 1008 bytes, allocs estimate: 9.

pr:

julia> @benchmark vcat(x, y) setup=(x=rand(1, 100); y=rand(1, 100))
BenchmarkTools.Trial: 10000 samples with 177 evaluations per sample.
 Range (min … max):  658.831 ns … 78.132 μs  ┊ GC (min … max): 0.00% … 97.60%
 Time  (median):     681.113 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   827.791 ns ±  1.256 μs  ┊ GC (mean ± σ):  9.59% ± 11.13%

  █▃▃                ▂▂                                      ▁ ▁
  ███▆▃▄▃▃▁▁▁▁▇▃▄▁▇█████▇▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ █
  659 ns        Histogram: log(frequency) by time      3.59 μs <

 Memory estimate: 1.64 KiB, allocs estimate: 2.

julia> @benchmark vcat(x, y) setup=(x=rand(2, 2); y=rand(2, 2))
BenchmarkTools.Trial: 10000 samples with 981 evaluations per sample.
 Range (min … max):  59.910 ns …  10.886 μs  ┊ GC (min … max):  0.00% … 98.93%
 Time  (median):     64.135 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   84.042 ns ± 227.515 ns  ┊ GC (mean ± σ):  12.12% ±  6.34%

  ▄▅█▆▂▁                                          ▁▂▃▂▁▂▂▂     ▂
  ████████▇██▇▇████▇▆▆▅▅▅▃▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▄▄▇█████████▆▆▆ █
  59.9 ns       Histogram: log(frequency) by time       153 ns <

 Memory estimate: 144 bytes, allocs estimate: 2.

julia> using SparseArrays

julia> @benchmark vcat(x, y) setup=(x=sprand(10, 10, 0.3); y=sprand(10, 10, 0.3))
BenchmarkTools.Trial: 10000 samples with 163 evaluations per sample.
 Range (min … max):  648.718 ns … 70.167 μs  ┊ GC (min … max):  0.00% … 98.16%
 Time  (median):     762.248 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   910.712 ns ±  1.446 μs  ┊ GC (mean ± σ):  10.24% ±  9.01%

  ██▄▃▁▂▄▃▁                                                    ▂
  █████████▇▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▄ █
  649 ns        Histogram: log(frequency) by time      6.77 μs <

 Memory estimate: 944 bytes, allocs estimate: 9.

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

BioTurboNick commented Feb 19, 2026

I see a few regressions locally:
[...]

I admit I'm confused, because I'm not touching the vcat path here...

These examples go through vcat in SparseArrays (🏴‍☠️), and ends up in Base._typed_vcat here:

function _typed_vcat(::Type{T}, A::AbstractVecOrTuple{AbstractVecOrMat}) where T

EDIT: Oh perhaps it's related to the @inline in setindex_shape_check. I'll verify.

@adienes
Copy link
Copy Markdown
Member

adienes commented Feb 19, 2026

well now I'm also confused because the performance is not stable across julia sessions... (on master as well) so those "regressions" were just noise. but that is a lot of noise

These examples go through vcat in SparseArrays (🏴‍☠️)

In this case I was running before loading SparseArrays

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

I couldn't reproduce these slowdowns, unfortunately. I reverted the @inline because it doesn't seem to impact anything; I assume I saw a reason to do so originally, but I don't recall what it was.

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

I noticed that in this PR, I have the start of a fix for this regression: JuliaGPU/GPUArrays.jl#672

I would just need to gate the small-array scalar optimization on isa(a, Array), but I can save that for a following PR.

@KristofferC KristofferC mentioned this pull request Feb 25, 2026
37 tasks
@BioTurboNick
Copy link
Copy Markdown
Contributor Author

@KristofferC can we please get this into 1.12?

@adienes
Copy link
Copy Markdown
Member

adienes commented Mar 26, 2026

I think the consequences are still not entirely evaluated. e.g. I'm seeing this regression:

julia> using BenchmarkTools

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  321.659 ns (6 allocations: 544 bytes) # master
  488.887 ns (10 allocations: 736 bytes) # PR

it might be easier to merge individual independent pieces of this PR. for example like the precomputed cumprod seems like an obvious strict win

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

BioTurboNick commented Mar 27, 2026

I think the consequences are still not entirely evaluated. e.g. I'm seeing this regression:

julia> using BenchmarkTools

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  321.659 ns (6 allocations: 544 bytes) # master
  488.887 ns (10 allocations: 736 bytes) # PR

it might be easier to merge individual independent pieces of this PR. for example like the precomputed cumprod seems like an obvious strict win

So, turns out this is what the @inline was for. 🙃

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> using BenchmarkTools
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf] (caches not reused: 1 for different Julia build configuration)
Precompiling BenchmarkTools finished.
  9 dependencies successfully precompiled in 24 seconds. 8 already precompiled.

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  626.927 ns (10 allocations: 736 bytes)

julia> function Base.setindex_shape_check(X::AbstractArray, I::Integer...)
           @inline
           li = ndims(X)
           lj = length(I)
           i = j = 1
           while true
               ii = length(axes(X,i))
               jj = I[j]
               if i == li || j == lj
                   while i < li
                       i += 1
                       ii *= length(axes(X,i))
                   end
                   while j < lj
                       j += 1
                       jj *= I[j]
                   end
                   if ii != jj
                       Base.throw_setindex_mismatch(X, I)
                   end
                   return
               end
               if ii == jj
                   i += 1
                   j += 1
               elseif ii == 1
                   i += 1
               elseif jj == 1
                   j += 1
               else
                   Base.throw_setindex_mismatch(X, I)
               end
           end
       end

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  209.506 ns (6 allocations: 544 bytes)

@adienes
Copy link
Copy Markdown
Member

adienes commented Mar 27, 2026

and after #59025 (on top of this PR), it would be way faster still

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  107.124 ns (6 allocations: 544 bytes)

as I understand it, this PR contains four mostly independent optimizations:

  • cat_similar + hvncat_fill! becomes a reshape call
  • precomputing cumprod
  • skip work for empty arrays (if !any(iszero, outdims))
  • the block-copying for length(a) > 4

the first three seem like pretty safe improvements. it's the last of these that strikes me as more fragile and harder to evaluate.

for example endindex = CartesianIndex(ntuple(i -> offsets[i] + cat_size(a, i), Val(N))) this may start allocating where previously the loop was non-allocating. also the threshold of length(a) > 4 seems unlikely to be uniformly appropriate across different dimensionalities and array types given that this is the method for all AbstractArrays. what about AbstractArray types without block-copying fast paths, since that will fall back to iterating over the range, will there be more overhead?

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

BioTurboNick commented Mar 30, 2026

How will #61426 interact with this?

What would you recommend I look at to test the block-copying better? Also I mentioned earlier that strictly making the for loop optimization only applicable to Array as it relates to ensuring this code works for GPUArrays. If there's a need for different array types to have their own logic here, would it make sense to factor it out and be a part of the array interface?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 1.12 Change should be backported to release-1.12 performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants