Fix `@simd` for non 1 step CartesianPartition by N5N3 · Pull Request #42736 · JuliaLang/julia

N5N3 · 2021-10-21T12:48:00Z

Previous code assumes that the steps of each axes are 1, which is not valid after #37829.
This PR tries to add support for non 1 step CartesianPartition.
Before

julia> @simd for i in view(CartesianIndices((1:2:3,1:2:3)),1:4)
       println(i)
       end
CartesianIndex(1, 1) # wrong result
CartesianIndex(2, 1)
CartesianIndex(3, 1)
CartesianIndex(1, 2)
CartesianIndex(2, 2)
CartesianIndex(3, 2)
CartesianIndex(1, 3)
CartesianIndex(2, 3)
CartesianIndex(3, 3)
julia> @simd for i in view(CartesianIndices((1:2:3,1:2:3,1:2:4)),1:4)
       println(i)
       end
ERROR: MethodError: no method matching LinearIndices(::Tuple{StepRange{Int64, Int64}, StepRange{Int64, Int64}})

After

julia> @simd for i in view(CartesianIndices((1:2:3,1:2:3)),1:4)
       println(i)
       end
CartesianIndex(1, 1)
CartesianIndex(3, 1)
CartesianIndex(1, 3)
CartesianIndex(3, 3)
julia> @simd for i in view(CartesianIndices((1:2:3,1:2:3,1:2:4)),1:4)
       println(i)
       end
CartesianIndex(1, 1, 1)
CartesianIndex(3, 1, 1)
CartesianIndex(1, 3, 1)
CartesianIndex(3, 3, 1)

Besides the above fix, this PR also tries to simplify the inner loop with a Generator based outer range. It accelerates 3/4d CartesianPartition a little on my desktop.
Some benchmarks:

julia> f(a) = @inbounds @simd for i in view(CartesianIndices(a),2:length(a)-1)
       a[i] = 1.0
       end
julia> a = zeros(256,256);
julia> @btime f($a)
  10.700 μs (0 allocations: 0 bytes)  # before: 10.900 μs (0 allocations: 0 bytes)  
julia> a = zeros(64,64,64);
julia> @btime f($a)
  44.500 μs (0 allocations: 0 bytes)  # 49.700 μs (0 allocations: 0 bytes)
julia> a = zeros(16,16,16,16);
julia> @btime f($a)
  14.000 μs (0 allocations: 0 bytes)  # 15.800 μs (0 allocations: 0 bytes)

vtjnash · 2021-10-21T15:29:55Z

@nanosoldier runbenchmarks("simd", vs=":master")

test/iterators.jl

vtjnash · 2021-10-21T15:32:45Z

Looks like tests mostly failed because OpenBLAS tries to start too many threads and exhausts the kernel resources, but not related to this PR.

nanosoldier · 2021-10-21T16:48:12Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

N5N3 · 2021-10-22T03:14:39Z

~~Although strange, the performance of 4D @simd does increase after modification.~~ (looks like an oversight)
Now the @simd performance of CartesianPartition is close to CartesianIndices (at least if there is a long enough first dimension).

Benchmark code:

bench1(f::OP, a, b = ()) where OP = begin
    @inbounds @simd for i in eachindex(IndexCartesian(), a)
        a[i] = f(map(x -> (@inbounds x[i]), b)...)
    end
end
bench2(f::OP, a, b = ()) where OP = begin
    iter = view(eachindex(IndexCartesian(), a), 2:length(a) - 1)
    @inbounds @simd for i in iter
        a[i] = f(map(x -> (@inbounds x[i]), b)...)
    end
end
println("----------fill!------------")
N = 2^16
for n = 2:4
    a = zeros(N>>(3n-3), ntuple(_ -> 8, n - 1)...)
    println("N = ", n)
    @btime bench1(Returns(1), $a)
    @btime bench2(Returns(1), $a)
end
println("---------- .* ------------")
N = 2^16
for n = 2:4
    a = zeros(N>>(3n-3), ntuple(_ -> 8, n - 1)...)
    b = randn(size(a))
    c = randn(size(a))
    println("N = ", n)
    @btime bench1(*, $a, ($c, $b))
    @btime bench2(*, $a, ($c, $b))
end

Result:

----------fill!------------
N = 2
  10.800 μs (0 allocations: 0 bytes)
  11.100 μs (0 allocations: 0 bytes)
N = 3
  10.700 μs (0 allocations: 0 bytes)
  10.900 μs (0 allocations: 0 bytes)
N = 4
  11.100 μs (0 allocations: 0 bytes)
  10.800 μs (0 allocations: 0 bytes)
---------- .* ------------
N = 2
  22.700 μs (0 allocations: 0 bytes)
  22.800 μs (0 allocations: 0 bytes)
N = 3
  22.700 μs (0 allocations: 0 bytes)
  22.800 μs (0 allocations: 0 bytes)
N = 4
  22.800 μs (0 allocations: 0 bytes)
  23.300 μs (0 allocations: 0 bytes)

DilumAluthge · 2021-10-22T09:35:25Z

@vtjnash I notice that a few commits were pushed after you added the merge me label, so I'll remove the label until you have a chance to review the new commits.

DilumAluthge · 2021-10-22T09:35:52Z

But anyway, all CI is green (except for tester_freebsd64, which is a known issue).

vchuravy · 2021-10-22T16:45:15Z

@nanosoldier runbenchmarks("simd", vs=":master")

vchuravy · 2021-10-22T16:46:00Z

@N5N3 Coukd you also add some Benchmarks for this to BaseBenchmarks.jl?

N5N3 · 2021-10-22T17:14:12Z

I’d like to, but I’m quite unfamiliar with BaseBenchmarks.jl.
IIUC, it only tests Vector’s simd performance.
Should we also add bench for CartesianIndices?

vchuravy · 2021-10-22T17:57:29Z

Should we also add bench for CartesianIndices?

Yes please. This is quite sensitive to future code changes or unrelated changes in part of the pipeline.
I think we can land this before adding benchmarks there, but they would be great to have.

nanosoldier · 2021-10-23T00:13:36Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

vtjnash · 2021-10-23T04:09:23Z

Seemed like nanosoldier perf test failed also, whereas merge-me means if tests are good.

N5N3 · 2021-10-23T05:09:54Z

IIUC, nanosoldier's simd benchmark didn't cover CartesianPartition and CartesianIndices.
So it should be impossible for this PR to make any "visible" regression？

~~The worst case with 2.00 time ratio seems to be a summation with boundscheck over 1:length(X)~~
The worst case with 2.00 time ratio is the summation of -10^5:-1:1 with boundscheck, which also shouldn't be affected by this PR.

Anyway I agree that performance should be tested before we merge this PR. ~~I'll add Cartesian related simd benchmark to BaseBenchmarks.jl as soon as possible.~~(Edit: Benchmark added @vchuravy @vtjnash)

N5N3 · 2021-10-26T06:15:29Z

It tooks about 30ns to generate 2d reshaped CartesianIndices on my Desktop.
Since we only index into the outer range once, there's no need to make it fast-indexable.

By calling 3-args Base.ReshapedArray directly, the above overhead is avoided.
This does save some time for 3d/4d cases with small size.
(The definition of CartesianPartition is widen a little to make the view of a "fake" ReshapedArrayLF still a CartesianPartition)

N5N3 · 2021-10-27T10:14:59Z

Looks like all tests failed because of network problem.

N5N3 · 2021-11-21T10:44:27Z

Just notice that reused simd_index makes the position of @inbounds influence @simd for's performance. (BitArray's broadcast is the only use case I found in Base.)

On the other hand, I still think we'd better make CartesianIndices's simd_index always inbound like before, since it is really strange that @inbounds affects the speed of index iteration.

N5N3 · 2022-01-06T01:39:38Z

Bump?

revert changes in reshapedarray.jl use Iterators.rest

move `@inbounds` outside the loop body. see JuliaLang#38086

`first(Base.axes1())` works well, but `length(::StepRange)` won't be omitted by LLVM if `step` is unknown at compile time.

KristofferC · 2022-02-25T09:43:30Z

Rebased to get another CI run.

N5N3 · 2022-02-25T14:46:32Z

Thanks for pick this up.
Error on "julia-master" should be #42752 (thus unrelated).

(cherry picked from commit 2e2c16a)

@inbounds

… of `ReshapedArray` (#43518) This performance difference was found when working on #42736. Currently, our `ReshapedArray` use stride based `MultiplicativeInverse` to speed up index transformation. For example, for `a::AbstractArray{T,3}` and `b = vec(a)`, the index transformation is equivalent to: ```julia offset = i - 1 # b[i] d1, r1 = divrem(offset, stride(a, 3)) # stride(a, 3) = size(a, 1) * size(a, 2) d2, r2 = divrem(r1, stride(a, 2)) # stride(a, 2) = size(a, 1) CartesianIndex(r2 + 1, d2 +1, d1 + 1) # a has one-based axes ``` (All the `stride` is replaced with a `MultiplicativeInverse` to accelerate `divrem`) This PR wants to replace the above machinery with: ```julia offset = i - 1 d1, r1 = divrem(offset, size(a, 1)) d2, r2 = divrem(d1, size(a, 2)) CartesianIndex(r1 + 1, r2 +1, d2 + 1) ``` For random access, they should have the same computational cost. But for sequential access, like `sum(b)`, `size` based transformation seems faster. To avoid bottleneck from IO, use `reshape(::CartesianIndices, x...)` to benchmark: ```julia f(x) = let r = 0 for i in eachindex(x) @inbounds r |= +(x[i].I...) end r end a = CartesianIndices((99,100,101)); @Btime f(vec($a)); #2.766 ms --> 2.591 ms @Btime f(reshape($a,990,1010)); #3.412 ms --> 2.626 ms @Btime f(reshape($a,33,300,101)); #3.422 ms --> 2.342 ms ``` I haven't looked into the reason for this performance difference. Beside acceleration, this also makes it possible to reuse the `MultiplicativeInverse` in some cases (like #42736). So I think it might be useful? --------- Co-authored-by: Andy Dienes <51664769+adienes@users.noreply.github.com> Co-authored-by: Andy Dienes <andydienes@gmail.com>

vchuravy requested a review from mbauman October 21, 2021 13:24

vtjnash removed the needs tests Unit tests are required for this change label Oct 21, 2021

vtjnash reviewed Oct 21, 2021

View reviewed changes

test/iterators.jl Outdated Show resolved Hide resolved

vtjnash added merge me PR is reviewed. Merge when all tests are passing and removed needs nanosoldier run This PR should have benchmarks run on it labels Oct 21, 2021

DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Oct 22, 2021

KristofferC mentioned this pull request Oct 22, 2021

release-1.7: Backports for 1.7.0/1.7.0-rc3 #42765

Merged

66 tasks

N5N3 mentioned this pull request Oct 23, 2021

Add Cartesian related simd benchmarks JuliaCI/BaseBenchmarks.jl#284

Merged

KristofferC mentioned this pull request Oct 29, 2021

release-1.6: Backports for julia-1.6.4 #42147

Merged

95 tasks

N5N3 force-pushed the SimdCartesianPartition branch 2 times, most recently from ba8e114 to 86a627a Compare November 22, 2021 01:13

KristofferC mentioned this pull request Dec 2, 2021

release-1.7: Backports for 1.7.1 #43297

Merged

15 tasks

N5N3 force-pushed the SimdCartesianPartition branch from 86a627a to 8fef013 Compare December 19, 2021 15:45

N5N3 mentioned this pull request Dec 22, 2021

Use size based MultiplicativeInverse to speedup sequential access of ReshapedArray #43518

Merged

N5N3 force-pushed the SimdCartesianPartition branch from 8fef013 to 86d4375 Compare December 23, 2021 01:27

KristofferC mentioned this pull request Jan 10, 2022

release-1.6: Backports for 1.6.6 #43735

Merged

50 tasks

KristofferC added the merge me PR is reviewed. Merge when all tests are passing label Feb 25, 2022

N5N3 added 3 commits February 25, 2022 10:43

drop support of view(ReshapedArrayLF)

7791a1f

revert changes in reshapedarray.jl use Iterators.rest

Update broadcast.jl

f0c1eae

move `@inbounds` outside the loop body. see JuliaLang#38086

use firstindex in CartesianIndices's simd_index

90d0ea1

`first(Base.axes1())` works well, but `length(::StepRange)` won't be omitted by LLVM if `step` is unknown at compile time.

KristofferC force-pushed the SimdCartesianPartition branch from 86d4375 to 90d0ea1 Compare February 25, 2022 09:43

DilumAluthge merged commit 2e2c16a into JuliaLang:master Feb 25, 2022

DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Feb 25, 2022

N5N3 deleted the SimdCartesianPartition branch February 26, 2022 06:06

N5N3 added the backport 1.8 Change should be backported to release-1.8 label Feb 26, 2022

staticfloat pushed a commit to JuliaCI/julia-buildkite-testing that referenced this pull request Mar 2, 2022

Fix @simd for non 1 step CartesianPartition (JuliaLang#42736)

7703259

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022

Fix @simd for non 1 step CartesianPartition (JuliaLang#42736)

451b7ad

KristofferC pushed a commit that referenced this pull request Mar 15, 2022

Fix @simd for non 1 step CartesianPartition (#42736)

de35fbe

(cherry picked from commit 2e2c16a)

KristofferC mentioned this pull request Mar 15, 2022

Backports for julia 1.8-beta3/rc1 #44623

Closed

14 tasks

KristofferC pushed a commit that referenced this pull request Mar 15, 2022

Fix @simd for non 1 step CartesianPartition (#42736)

f0acf3c

(cherry picked from commit 2e2c16a)

KristofferC pushed a commit that referenced this pull request Mar 16, 2022

Fix @simd for non 1 step CartesianPartition (#42736)

00ea663

(cherry picked from commit 2e2c16a)

KristofferC mentioned this pull request Mar 18, 2022

More backports for julia 1.8-beta2 #44675

Merged

17 tasks

KristofferC removed the backport 1.8 Change should be backported to release-1.8 label Mar 21, 2022

KristofferC removed the backport 1.6 Change should be backported to release-1.6 label May 16, 2022

Uh oh!

Conversation

N5N3 commented Oct 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vtjnash commented Oct 21, 2021

Uh oh!

Uh oh!

vtjnash commented Oct 21, 2021

Uh oh!

nanosoldier commented Oct 21, 2021

Uh oh!

N5N3 commented Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DilumAluthge commented Oct 22, 2021

Uh oh!

DilumAluthge commented Oct 22, 2021

Uh oh!

vchuravy commented Oct 22, 2021

Uh oh!

vchuravy commented Oct 22, 2021

Uh oh!

N5N3 commented Oct 22, 2021

Uh oh!

vchuravy commented Oct 22, 2021

Uh oh!

nanosoldier commented Oct 23, 2021

Uh oh!

vtjnash commented Oct 23, 2021

Uh oh!

N5N3 commented Oct 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

N5N3 commented Oct 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

N5N3 commented Oct 27, 2021

Uh oh!

N5N3 commented Nov 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

N5N3 commented Jan 6, 2022

Uh oh!

KristofferC commented Feb 25, 2022

Uh oh!

N5N3 commented Feb 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

N5N3 commented Oct 21, 2021 •

edited

Loading

N5N3 commented Oct 22, 2021 •

edited

Loading

N5N3 commented Oct 23, 2021 •

edited

Loading

N5N3 commented Oct 26, 2021 •

edited

Loading

N5N3 commented Nov 21, 2021 •

edited

Loading