Fix @simd for non 1 step CartesianPartition#42736
Fix @simd for non 1 step CartesianPartition#42736DilumAluthge merged 3 commits intoJuliaLang:masterfrom
@simd for non 1 step CartesianPartition#42736Conversation
|
@nanosoldier |
|
Looks like tests mostly failed because OpenBLAS tries to start too many threads and exhausts the kernel resources, but not related to this PR. |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
|
Benchmark code:
bench1(f::OP, a, b = ()) where OP = begin
@inbounds @simd for i in eachindex(IndexCartesian(), a)
a[i] = f(map(x -> (@inbounds x[i]), b)...)
end
end
bench2(f::OP, a, b = ()) where OP = begin
iter = view(eachindex(IndexCartesian(), a), 2:length(a) - 1)
@inbounds @simd for i in iter
a[i] = f(map(x -> (@inbounds x[i]), b)...)
end
end
println("----------fill!------------")
N = 2^16
for n = 2:4
a = zeros(N>>(3n-3), ntuple(_ -> 8, n - 1)...)
println("N = ", n)
@btime bench1(Returns(1), $a)
@btime bench2(Returns(1), $a)
end
println("---------- .* ------------")
N = 2^16
for n = 2:4
a = zeros(N>>(3n-3), ntuple(_ -> 8, n - 1)...)
b = randn(size(a))
c = randn(size(a))
println("N = ", n)
@btime bench1(*, $a, ($c, $b))
@btime bench2(*, $a, ($c, $b))
endResult:
----------fill!------------
N = 2
10.800 μs (0 allocations: 0 bytes)
11.100 μs (0 allocations: 0 bytes)
N = 3
10.700 μs (0 allocations: 0 bytes)
10.900 μs (0 allocations: 0 bytes)
N = 4
11.100 μs (0 allocations: 0 bytes)
10.800 μs (0 allocations: 0 bytes)
---------- .* ------------
N = 2
22.700 μs (0 allocations: 0 bytes)
22.800 μs (0 allocations: 0 bytes)
N = 3
22.700 μs (0 allocations: 0 bytes)
22.800 μs (0 allocations: 0 bytes)
N = 4
22.800 μs (0 allocations: 0 bytes)
23.300 μs (0 allocations: 0 bytes) |
|
@vtjnash I notice that a few commits were pushed after you added the |
|
But anyway, all CI is green (except for |
|
@nanosoldier |
|
@N5N3 Coukd you also add some Benchmarks for this to BaseBenchmarks.jl? |
|
I’d like to, but I’m quite unfamiliar with BaseBenchmarks.jl. |
Yes please. This is quite sensitive to future code changes or unrelated changes in part of the pipeline. |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
|
Seemed like nanosoldier perf test failed also, whereas merge-me means if tests are good. |
|
IIUC,
Anyway I agree that performance should be tested before we merge this PR. |
|
It tooks about 30ns to generate 2d reshaped By calling 3-args |
|
Looks like all tests failed because of network problem. |
|
Just notice that reused On the other hand, I still think we'd better make |
ba8e114 to
86a627a
Compare
86a627a to
8fef013
Compare
8fef013 to
86d4375
Compare
|
Bump? |
revert changes in reshapedarray.jl use Iterators.rest
move `@inbounds` outside the loop body. see JuliaLang#38086
`first(Base.axes1())` works well, but `length(::StepRange)` won't be omitted by LLVM if `step` is unknown at compile time.
86d4375 to
90d0ea1
Compare
|
Rebased to get another CI run. |
|
Thanks for pick this up. |
… of `ReshapedArray` (#43518) This performance difference was found when working on #42736. Currently, our `ReshapedArray` use stride based `MultiplicativeInverse` to speed up index transformation. For example, for `a::AbstractArray{T,3}` and `b = vec(a)`, the index transformation is equivalent to: ```julia offset = i - 1 # b[i] d1, r1 = divrem(offset, stride(a, 3)) # stride(a, 3) = size(a, 1) * size(a, 2) d2, r2 = divrem(r1, stride(a, 2)) # stride(a, 2) = size(a, 1) CartesianIndex(r2 + 1, d2 +1, d1 + 1) # a has one-based axes ``` (All the `stride` is replaced with a `MultiplicativeInverse` to accelerate `divrem`) This PR wants to replace the above machinery with: ```julia offset = i - 1 d1, r1 = divrem(offset, size(a, 1)) d2, r2 = divrem(d1, size(a, 2)) CartesianIndex(r1 + 1, r2 +1, d2 + 1) ``` For random access, they should have the same computational cost. But for sequential access, like `sum(b)`, `size` based transformation seems faster. To avoid bottleneck from IO, use `reshape(::CartesianIndices, x...)` to benchmark: ```julia f(x) = let r = 0 for i in eachindex(x) @inbounds r |= +(x[i].I...) end r end a = CartesianIndices((99,100,101)); @Btime f(vec($a)); #2.766 ms --> 2.591 ms @Btime f(reshape($a,990,1010)); #3.412 ms --> 2.626 ms @Btime f(reshape($a,33,300,101)); #3.422 ms --> 2.342 ms ``` I haven't looked into the reason for this performance difference. Beside acceleration, this also makes it possible to reuse the `MultiplicativeInverse` in some cases (like #42736). So I think it might be useful? --------- Co-authored-by: Andy Dienes <51664769+adienes@users.noreply.github.com> Co-authored-by: Andy Dienes <andydienes@gmail.com>
Previous code assumes that the steps of each axes are 1, which is not valid after #37829.
This PR tries to add support for non 1 step
CartesianPartition.Before
After
Besides the above fix, this PR also tries to simplify the inner loop with a
Generatorbased outer range. It accelerates 3/4d CartesianPartition a little on my desktop.Some benchmarks: