Fix performance regression in hvcat of simple matrices by BioTurboNick · Pull Request #57422 · JuliaLang/julia

BioTurboNick · 2025-02-15T07:11:34Z

As pointed out by @Zentrik here, recent Nanosoldier runs suggested a significant performance regression for simple hvcats as a result of #39729 .

I revisted the code and determined that the main cause was that typed_hvncat iterated over each element and had to calculate the linear index for each one, resulting in many multiplication and addition operations for each element. I realized that CartesianIndices could be used to restore the copy-as-a-whole pattern that typed_hvcat used, while retaining generality for arbitrary dimensions.

As I recall, I believe a limitation when I wrote the hvncat code was that certain features were not available during compiler bootstrapping, requiring fully manual indexing. Since the compiler has been made an stdlib, I believe that made this PR possible.

~~Before merging I would also want to check that I didn't hurt the hvncat performance at all.~~ Done

This should ideally be marked for 1.12 backport.

BioTurboNick · 2025-02-15T13:42:12Z

I don't think the test failure is related? It occurred testing Profile module... EDIT: Yep, not related.

BioTurboNick · 2025-02-15T18:35:46Z

There's unfortunately extra overhead for everything else not intended to be addressed, looks like mostly because getindex with CartesianIndices unfortunately relies on slow integer division via _ind2sub.

EDIT: Ugh, there seems to be an annoying trade-off in performance. I'll need to explore further.

BioTurboNick · 2025-02-16T07:32:28Z

I believe I got it. The overhead of the block copy was too much for small arrays, so I added a branch to use the original loop for those. Crossover point seemed to be around 4-8 elements, so I branched at >4.

Two other aspects addressed:

1d arrays of pure numbers were a bit slow compared with cat, so I adopted its approach
Identified significant performance reduction in an important case (see below), and found unusual time spent in setindex_shape_check. Adding @inline eliminated the bottleneck entirely, though could that be a symptom of a broader regression?

const x = [1 2; 3 4;;; 5 6; 7 8] # cat([1 2; 3 4], [5 6; 7 8], dims=3)
const y = x .+ 1
e17() = [x ;;; x ;;;; y ;;; y] # 99.356 ns (6 allocations: 544 bytes), was 4x slower and many more allocations

EDIT: There was one trade-off I didn't find an optimal solution for, and settled on resolving in favor of all-arrays as the more common choice (no change from master). If the elements to cat are all arrays, then the dimension-calculation in _typed_hvncat_dims is more efficient iterating over eachindex of the tuple and indexing into it. If the elements are a mixture of arrays and scalars, then iterating over the elements with enumerate is more efficient. If the situations are swapped, there's substantial overhead indexing into the tuple (mixed arrays and scalars), or substantial overhead performing the iteration itself (just arrays). Ultimately not a big impact, but a bit of gripe that the compiler can be fickle like that.

vtjnash

SGTM

vtjnash · 2025-10-16T16:31:35Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2025-10-16T22:36:00Z

Your job failed.

BioTurboNick · 2025-10-27T19:41:44Z

Is the nanosoldier failure something to do with the PR, or does it just need to be rerun?

BioTurboNick · 2025-11-21T04:30:03Z

@vtjnash - should nanosoldier be rerun, or is something wrong that I need to fix?

vtjnash · 2025-11-21T18:04:42Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2025-11-21T18:36:26Z

Your job failed.

vtjnash · 2025-11-21T19:45:12Z

It is a bug in BaseBenchmarks

      From worker 3:    ERROR: LoadError: UndefVarError: `WorldView` not defined in `BaseBenchmarks.InferenceBenchmarks`
      From worker 3:    Suggestion: this global was defined as `Compiler.WorldView` but not assigned a value.
      From worker 3:    Stacktrace:
      From worker 3:      [1] top-level scope
      From worker 3:        @ /home/nanosoldier/.julia/dev/BaseBenchmarks/src/inference/InferenceBenchmarks.jl:88

nanosoldier · 2026-02-18T15:53:48Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

BioTurboNick · 2026-02-18T16:13:55Z

Thanks for getting it running!

I'm guessing some of these are noise, and that the high-% ones indicate higher-variance tests, but these stand out for me to check into:

test	time	memory
["array", "cat", ("catnd", 5)]	0.97 (5%)	0.93 (1%) ✅
["array", "cat", ("catnd_setind", 5)]	0.51 (5%) ✅	0.89 (1%) ✅
["array", "cat", ("hvcat", 5)]	0.64 (5%) ✅	1.00 (1%)
["array", "cat", ("hvcat", 500)]	0.53 (5%) ✅	1.00 (1%)
["sparse", "transpose", ("adjoint", "(20000, 10000)")]	0.59 (30%) ✅	1.00 (1%)
["union", "array", ("skipmissing", "perf_sumskipmissing", "Union{Nothing, Int64}", 0)]	0.80 (5%) ✅	1.00 (1%)
["find", "findprev", ("Vector{Bool}", "50-50")]	3.21 (5%) ❌	1.00 (1%)
["scalar", "atan2", ("x one", "Float32")]	1.42 (5%) ❌	1.00 (1%)
["scalar", "intfuncs", ("# 6", "UInt64", "+")]	1.57 (25%) ❌	1.00 (1%)
["sparse", "transpose", ("adjoint", "(20000, 20000)")]	8.45 (30%) ❌	1.00 (1%)
["tuple", "reduction", ("sum", "(8, 8)")]	1.34 (5%) ❌	1.00 (1%)

vtjnash · 2026-02-18T17:16:34Z

That higher variance one is a constant folded expression, so those just test the reliability of the CPU branch predictor

BioTurboNick · 2026-02-19T20:34:21Z

None of the benchmarks of concern validated locally, so I think this PR is good to go? Saw differentials of 0.9-1.10 for all of the questionable ones between master, and this branch rebased onto master, and I don't think they touch the cat code anyway.

Summary: ~2x speedup for a range of hvcat-family operations.

adienes · 2026-02-19T21:31:25Z

I see a few regressions locally:

master:

julia> @benchmark vcat(x, y) setup=(x=rand(1, 100); y=rand(1, 100))
BenchmarkTools.Trial: 10000 samples with 192 evaluations per sample.
 Range (min … max):  508.010 ns … 50.128 μs  ┊ GC (min … max):  0.00% … 98.55%
 Time  (median):     533.505 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   679.812 ns ±  1.098 μs  ┊ GC (mean ± σ):  11.34% ± 11.14%

  █▃▃▂             ▁▃▂                                         ▁
  ████▅▄▄▃▃▁▃▁▃▁▃▁▃███▇▇▆▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆█▄▇ █
  508 ns        Histogram: log(frequency) by time      3.65 μs <

 Memory estimate: 1.64 KiB, allocs estimate: 2.

julia> @benchmark vcat(x, y) setup=(x=rand(2, 2); y=rand(2, 2))
BenchmarkTools.Trial: 10000 samples with 986 evaluations per sample.
 Range (min … max):  53.007 ns …  11.198 μs  ┊ GC (min … max):  0.00% … 98.34%
 Time  (median):     56.499 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   78.872 ns ± 253.613 ns  ┊ GC (mean ± σ):  14.23% ±  5.81%

  ▃█▇▅▂    ▁▁                                  ▁▂▂▂▂▂▂▂▁       ▂
  ██████▇▇▇███▇▇▇▇▇▆▇▆▆▆▆▇▇▆▆▇▅▄▃▁▁▁▁▁▃▁▁▁▃▁▁▁██████████▇▇▇▆▇▇ █
  53 ns         Histogram: log(frequency) by time       150 ns <

 Memory estimate: 144 bytes, allocs estimate: 2.

julia> using SparseArrays

julia> @benchmark vcat(x, y) setup=(x=sprand(10, 10, 0.3); y=sprand(10, 10, 0.3))
BenchmarkTools.Trial: 10000 samples with 189 evaluations per sample.
 Range (min … max):  494.032 ns … 63.577 μs  ┊ GC (min … max):  0.00% … 98.08%
 Time  (median):     627.537 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   793.259 ns ±  1.513 μs  ┊ GC (mean ± σ):  13.05% ±  8.97%

   ▃▄▅▅██▇▆▆▆▆▅▄▃▃▃▃▂▂▁▁         ▁ ▁▂▂▁▂▂▁▁                    ▃
  ▇████████████████████████▇█▇▇▇████████████████▆▇▆▆▃▅▅▄▃▄▁▁▁▃ █
  494 ns        Histogram: log(frequency) by time      1.58 μs <

 Memory estimate: 1008 bytes, allocs estimate: 9.

pr:

julia> @benchmark vcat(x, y) setup=(x=rand(1, 100); y=rand(1, 100))
BenchmarkTools.Trial: 10000 samples with 177 evaluations per sample.
 Range (min … max):  658.831 ns … 78.132 μs  ┊ GC (min … max): 0.00% … 97.60%
 Time  (median):     681.113 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   827.791 ns ±  1.256 μs  ┊ GC (mean ± σ):  9.59% ± 11.13%

  █▃▃                ▂▂                                      ▁ ▁
  ███▆▃▄▃▃▁▁▁▁▇▃▄▁▇█████▇▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ █
  659 ns        Histogram: log(frequency) by time      3.59 μs <

 Memory estimate: 1.64 KiB, allocs estimate: 2.

julia> @benchmark vcat(x, y) setup=(x=rand(2, 2); y=rand(2, 2))
BenchmarkTools.Trial: 10000 samples with 981 evaluations per sample.
 Range (min … max):  59.910 ns …  10.886 μs  ┊ GC (min … max):  0.00% … 98.93%
 Time  (median):     64.135 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   84.042 ns ± 227.515 ns  ┊ GC (mean ± σ):  12.12% ±  6.34%

  ▄▅█▆▂▁                                          ▁▂▃▂▁▂▂▂     ▂
  ████████▇██▇▇████▇▆▆▅▅▅▃▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▄▄▇█████████▆▆▆ █
  59.9 ns       Histogram: log(frequency) by time       153 ns <

 Memory estimate: 144 bytes, allocs estimate: 2.

julia> using SparseArrays

julia> @benchmark vcat(x, y) setup=(x=sprand(10, 10, 0.3); y=sprand(10, 10, 0.3))
BenchmarkTools.Trial: 10000 samples with 163 evaluations per sample.
 Range (min … max):  648.718 ns … 70.167 μs  ┊ GC (min … max):  0.00% … 98.16%
 Time  (median):     762.248 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   910.712 ns ±  1.446 μs  ┊ GC (mean ± σ):  10.24% ±  9.01%

  ██▄▃▁▂▄▃▁                                                    ▂
  █████████▇▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▄ █
  649 ns        Histogram: log(frequency) by time      6.77 μs <

 Memory estimate: 944 bytes, allocs estimate: 9.

base/abstractarray.jl

base/indices.jl

BioTurboNick · 2026-02-19T22:26:02Z

I see a few regressions locally:
[...]

I admit I'm confused, because I'm not touching the vcat path here...

These examples go through vcat in SparseArrays (🏴‍☠️), and ends up in Base._typed_vcat here:

julia/base/abstractarray.jl

Line 1750 in 5fe89b8

    
           function _typed_vcat(::Type{T}, A::AbstractVecOrTuple{AbstractVecOrMat}) where T

EDIT: Oh perhaps it's related to the @inline in setindex_shape_check. I'll verify.

adienes · 2026-02-19T22:43:22Z

well now I'm also confused because the performance is not stable across julia sessions... (on master as well) so those "regressions" were just noise. but that is a lot of noise

These examples go through vcat in SparseArrays (🏴‍☠️)

In this case I was running before loading SparseArrays

Co-authored-by: Andy Dienes <51664769+adienes@users.noreply.github.com>

BioTurboNick · 2026-02-20T05:44:02Z

I couldn't reproduce these slowdowns, unfortunately. I reverted the @inline because it doesn't seem to impact anything; I assume I saw a reason to do so originally, but I don't recall what it was.

BioTurboNick · 2026-02-20T07:05:31Z

I noticed that in this PR, I have the start of a fix for this regression: JuliaGPU/GPUArrays.jl#672

I would just need to gate the small-array scalar optimization on isa(a, Array), but I can save that for a following PR.

BioTurboNick · 2026-03-25T20:51:41Z

@KristofferC can we please get this into 1.12?

adienes · 2026-03-26T03:22:27Z

I think the consequences are still not entirely evaluated. e.g. I'm seeing this regression:

julia> using BenchmarkTools

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  321.659 ns (6 allocations: 544 bytes) # master
  488.887 ns (10 allocations: 736 bytes) # PR

it might be easier to merge individual independent pieces of this PR. for example like the precomputed cumprod seems like an obvious strict win

BioTurboNick · 2026-03-27T15:26:13Z

I think the consequences are still not entirely evaluated. e.g. I'm seeing this regression:
julia> using BenchmarkTools

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  321.659 ns (6 allocations: 544 bytes) # master
  488.887 ns (10 allocations: 736 bytes) # PR
it might be easier to merge individual independent pieces of this PR. for example like the precomputed cumprod seems like an obvious strict win

So, turns out this is what the @inline was for. 🙃

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> using BenchmarkTools
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf] (caches not reused: 1 for different Julia build configuration)
Precompiling BenchmarkTools finished.
  9 dependencies successfully precompiled in 24 seconds. 8 already precompiled.

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  626.927 ns (10 allocations: 736 bytes)

julia> function Base.setindex_shape_check(X::AbstractArray, I::Integer...)
           @inline
           li = ndims(X)
           lj = length(I)
           i = j = 1
           while true
               ii = length(axes(X,i))
               jj = I[j]
               if i == li || j == lj
                   while i < li
                       i += 1
                       ii *= length(axes(X,i))
                   end
                   while j < lj
                       j += 1
                       jj *= I[j]
                   end
                   if ii != jj
                       Base.throw_setindex_mismatch(X, I)
                   end
                   return
               end
               if ii == jj
                   i += 1
                   j += 1
               elseif ii == 1
                   i += 1
               elseif jj == 1
                   j += 1
               else
                   Base.throw_setindex_mismatch(X, I)
               end
           end
       end

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  209.506 ns (6 allocations: 544 bytes)

adienes · 2026-03-27T16:23:07Z

and after #59025 (on top of this PR), it would be way faster still

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  107.124 ns (6 allocations: 544 bytes)

as I understand it, this PR contains four mostly independent optimizations:

cat_similar + hvncat_fill! becomes a reshape call
precomputing cumprod
skip work for empty arrays (if !any(iszero, outdims))
the block-copying for length(a) > 4

the first three seem like pretty safe improvements. it's the last of these that strikes me as more fragile and harder to evaluate.

for example endindex = CartesianIndex(ntuple(i -> offsets[i] + cat_size(a, i), Val(N))) this may start allocating where previously the loop was non-allocating. also the threshold of length(a) > 4 seems unlikely to be uniformly appropriate across different dimensionalities and array types given that this is the method for all AbstractArrays. what about AbstractArray types without block-copying fast paths, since that will fall back to iterating over the range, will there be more overhead?

BioTurboNick · 2026-03-30T19:42:21Z

How will #61426 interact with this?

What would you recommend I look at to test the block-copying better? Also I mentioned earlier that strictly making the for loop optimization only applicable to Array as it relates to ensuring this code works for GPUArrays. If there's a need for different array types to have their own logic here, would it make sense to factor it out and be a part of the array interface?

KristofferC added performance Must go faster backport 1.12 Change should be backported to release-1.12 labels Feb 15, 2025

KristofferC mentioned this pull request Feb 15, 2025

Backports 1.12 #57408

Merged

31 tasks

KristofferC mentioned this pull request Feb 17, 2025

Backports for 1.12 #57444

Merged

24 tasks

KristofferC mentioned this pull request Feb 26, 2025

Backports release 1.12 #57536

Merged

This was referenced Mar 24, 2025

Backports release 1.12 #57871

Merged

Backports release 1.12 #57955

Merged

KristofferC mentioned this pull request Apr 4, 2025

Backports for 1.12.0-beta2 #58009

Merged

51 tasks

KristofferC mentioned this pull request Apr 29, 2025

Backports for 1.12.0-beta3 #58270

Merged

53 tasks

KristofferC mentioned this pull request May 9, 2025

Backports for 1.12.0-beta4 #58369

Merged

58 tasks

KristofferC mentioned this pull request Jun 6, 2025

Backports for 1.12.0-rc1 #58655

Merged

60 tasks

KristofferC mentioned this pull request Jul 22, 2025

Backports for 1.12-rc2 #59061

Merged

20 tasks

KristofferC mentioned this pull request Aug 6, 2025

Backports for 1.12.0-rc2 #59110

Merged

38 tasks

KristofferC mentioned this pull request Aug 19, 2025

Backports for 1.12-rc2 #59337

Merged

27 tasks

This was referenced Sep 15, 2025

Backports for 1.12.0-rc3 (or 1.12.0) #59556

Merged

Backports for 1.12.0-rc3 (or 1.12.0) #59577

Merged

This was referenced Sep 24, 2025

Backports for 1.12.0-rc3 #59624

Merged

Backports for 1.12.1 #59705

Merged

vtjnash reviewed Oct 16, 2025

View reviewed changes

KristofferC mentioned this pull request Oct 21, 2025

Backports for 1.12.2 #59920

Merged

35 tasks

adienes reviewed Feb 19, 2026

View reviewed changes

base/abstractarray.jl Outdated Show resolved Hide resolved

adienes reviewed Feb 19, 2026

View reviewed changes

base/indices.jl Show resolved Hide resolved

BioTurboNick and others added 11 commits February 19, 2026 22:43

Reduce multiplication with high dimensions

085f14f

More efficient copying of vectors/matricies

dc3410d

Simplify array copying as chunks, and guard against 0-length arrays

e71a8ad

whitespace

d84cd6e

Balance performance trade-offs

eb576dc

Eliminate bottleneck

ba39295

Remove stray Base

56b0f26

Move inline in

b8a1b34

Fix lost type information in typed_hvncat for 1-d arrays

ef21493

use method

1640127

Co-authored-by: Andy Dienes <51664769+adienes@users.noreply.github.com>

Remove unnecessary inline

8fb69a8

BioTurboNick force-pushed the fix-perf-hvcat-mats branch from c5e9be3 to 8fb69a8 Compare February 20, 2026 05:41

KristofferC mentioned this pull request Feb 25, 2026

Backports for 1.12.6 #61154

Merged

37 tasks

Restore actually-necessary @inline

7850c04

Uh oh!

Conversation

BioTurboNick commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioTurboNick commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioTurboNick commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioTurboNick commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vtjnash left a comment

Choose a reason for hiding this comment

Uh oh!

vtjnash commented Oct 16, 2025

Uh oh!

nanosoldier commented Oct 16, 2025

Uh oh!

BioTurboNick commented Oct 27, 2025

Uh oh!

BioTurboNick commented Nov 21, 2025

Uh oh!

vtjnash commented Nov 21, 2025

Uh oh!

nanosoldier commented Nov 21, 2025

Uh oh!

vtjnash commented Nov 21, 2025

Uh oh!

nanosoldier commented Feb 18, 2026

Uh oh!

BioTurboNick commented Feb 18, 2026

Uh oh!

vtjnash commented Feb 18, 2026

Uh oh!

BioTurboNick commented Feb 19, 2026

Uh oh!

adienes commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

BioTurboNick commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adienes commented Feb 19, 2026

Uh oh!

BioTurboNick commented Feb 20, 2026

Uh oh!

BioTurboNick commented Feb 20, 2026

Uh oh!

BioTurboNick commented Mar 25, 2026

Uh oh!

adienes commented Mar 26, 2026

Uh oh!

BioTurboNick commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adienes commented Mar 27, 2026

Uh oh!

BioTurboNick commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

BioTurboNick commented Feb 15, 2025 •

edited

Loading

BioTurboNick commented Feb 15, 2025 •

edited

Loading

BioTurboNick commented Feb 15, 2025 •

edited

Loading

BioTurboNick commented Feb 16, 2025 •

edited

Loading

BioTurboNick commented Feb 19, 2026 •

edited

Loading

BioTurboNick commented Mar 27, 2026 •

edited

Loading

BioTurboNick commented Mar 30, 2026 •

edited

Loading