I am working on a project which include some simple array operations in a huge array. i.e. A example here
function singleoperation!(A::Array,B::Array,C::Array)
@simd for k in eachindex(A)
@inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
I try to parallelize it to get a faster speed. To parallelize it, I am using distirbuded and share array function, which just modified a bit on the function I just show:
@everywhere function paralleloperation(A::SharedArray,B::SharedArray,C::SharedArray)
@sync @distributed for k in eachindex(A)
@inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
However, there has no time difference between two functions even I am using 4 threads (with the try on R7-5800x and I7-9750H CPU). Can I know anythings I can improve in this code? Thanks a lot! I will post the full testing code in below:
using Distributed
addprocs(4)
@everywhere begin
using SharedArrays
using BenchmarkTools
end
@everywhere function paralleloperation!(A::SharedArray,B::SharedArray,C::SharedArray)
@sync @distributed for k in eachindex(A)
@inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
function singleoperation!(A::Array,B::Array,C::Array)
@simd for k in eachindex(A)
@inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);
end
end
N = 128;
A,B,C = fill(0,N,N,N),fill(.2,N,N,N),fill(.3,N,N,N);
AN,BN,CN = SharedArray(fill(0,N,N,N)),SharedArray(fill(.2,N,N,N)),SharedArray(fill(.3,N,N,N));
@benchmark singleoperation!(A,B,C);
BenchmarkTools.Trial: 1612 samples with 1 evaluation.
Range (min … max): 2.582 ms … 9.358 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.796 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.086 ms ± 790.997 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
@benchmark paralleloperation!(AN,BN,CN);
BenchmarkTools.Trial: 1404 samples with 1 evaluation.
Range (min … max): 2.538 ms … 17.651 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.154 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.548 ms ± 1.238 ms ┊ GC (mean ± σ): 0.08% ± 1.65%
As the comments note, this looks like perhaps more of a job for multithreading than multiprocessing. The best approach in detail will generally depend on whether you are CPU-bound or memory-bandwith-bound. With so simple a calculation as in the example, it may well be the latter, in which case you will reach a point of diminishing returns from adding additional threads, and and may want to turn to something featuring explicit memory modelling, and/or to GPUs.
However, one very easy general-purpose approach would be to use the multithreading built-in to LoopVectorization.jl
A = rand(10000,10000)
B = rand(10000,10000)
C = zeros(10000,10000)
# Base
function singleoperation!(A,B,C)
@inbounds @simd for k in eachindex(A)
C[k] = A[k] * B[k] / (A[k] + B[k])
end
end
using LoopVectorization
function singleoperation_lv!(A,B,C)
@turbo for k in eachindex(A)
C[k] = A[k] * B[k] / (A[k] + B[k])
end
end
# Multithreaded (make sure you've started Julia with multiple threads)
function threadedoperation_lv!(A,B,C)
@tturbo for k in eachindex(A)
C[k] = A[k] * B[k] / (A[k] + B[k])
end
end
which gives us
julia> @benchmark singleoperation!(A,B,C)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
Range (min … max): 163.484 ms … 164.022 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 163.664 ms ┊ GC (median): 0.00%
Time (mean ± σ): 163.701 ms ± 118.397 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▄ ▄▄ ▁ ▁ ▁
▆▁▁▁▁▁▁▁▁▁▁▁▁▆█▆▆█▆██▁█▆▁▆█▁▁▁▆▁▁▁▁▆▁▁▁▁▁▁▁▁█▁▁▁▆▁▁▁▁▁▁▆▁▁▁▁▆ ▁
163 ms Histogram: frequency by time 164 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark singleoperation_lv!(A,B,C)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
Range (min … max): 163.252 ms … 163.754 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 163.408 ms ┊ GC (median): 0.00%
Time (mean ± σ): 163.453 ms ± 130.212 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▃ ▃█▃ █ ▃ ▃
▇▁▁▁▁▁▇▁▁▁▇██▇▁▇███▇▁▁█▇█▁▁▁▁▁▁▁▁█▁▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▇▇▁▇▁▇ ▁
163 ms Histogram: frequency by time 164 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark threadedoperation_lv!(A,B,C)
BenchmarkTools.Trial: 57 samples with 1 evaluation.
Range (min … max): 86.976 ms … 88.595 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 87.642 ms ┊ GC (median): 0.00%
Time (mean ± σ): 87.727 ms ± 439.427 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅ █ ▂ ▂
▅▁▁▁▁██▁▅█▁█▁▁▅▅█▅█▁█▅██▅█▁██▁▅▁▁▅▅▁▅▁▅▁▅▁▅▁▅▁▁▁▅▁█▁▁█▅▁▅▁▅█ ▁
87 ms Histogram: frequency by time 88.5 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Now, the fact that the singlethreaded LoopVectorization @turbo
version is almost perfectly tied with the singlethreaded @inbounds @simd
version is to me a hint that we are probably memory-bandwidth bound here (usually @turbo
is notably faster than @inbounds @simd
, so the tie suggests that the actual calculation is not the bottleneck) -- in which case the multithreaded version is only helping us by getting us access to a bit more memory bandwidth (though with diminishing returns, assuming there is some main memory bus that can only go so fast regardless of how many cores it can talk to).
To get a bit more insight, let's try making the arithmetic a bit harder:
function singlemoremath!(A,B,C)
@inbounds @simd for k in eachindex(A)
C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
end
end
using LoopVectorization
function singlemoremath_lv!(A,B,C)
@turbo for k in eachindex(A)
C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
end
end
function threadedmoremath_lv!(A,B,C)
@tturbo for k in eachindex(A)
C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
end
end
then sure enough
julia> @benchmark singlemoremath!(A,B,C)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 2.651 s … 2.652 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.651 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.651 s ± 792.423 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.65 s Histogram: frequency by time 2.65 s <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark singlemoremath_lv!(A,B,C)
BenchmarkTools.Trial: 19 samples with 1 evaluation.
Range (min … max): 268.101 ms … 270.072 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 269.016 ms ┊ GC (median): 0.00%
Time (mean ± σ): 269.058 ms ± 467.744 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ █ ▁ ▁ ▁▁█ ▁ ██ ▁ ▁ ▁ ▁ ▁
█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁█▁▁███▁█▁▁██▁▁▁█▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█ ▁
268 ms Histogram: frequency by time 270 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark threadedmoremath_lv!(A,B,C)
BenchmarkTools.Trial: 56 samples with 1 evaluation.
Range (min … max): 88.247 ms … 93.590 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 89.325 ms ┊ GC (median): 0.00%
Time (mean ± σ): 89.707 ms ± 1.200 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄ ▁ ▄█ ▄▄▄ ▁ ▄ ▁ ▁ ▁ ▁ ▁▄
█▁█▆██▆▆▆███▆▁█▁█▁▆▁█▆▁█▆▆▁▁▆█▁▁▁▁▁▁█▆██▁▁▆▆▁▁▆▁▁▁▁▁▁▁▁▁▁▆▆ ▁
88.2 ms Histogram: frequency by time 92.4 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
now we're closer to CPU-bound, and now threading and SIMD-vectorization is the difference between 2.6 seconds and 90 ms!
If your real problem is going to be as memory-bound as the example problem, you may consider working on GPU, on a server optimized for memory bandwidth, and/or using a package that puts a lot of effort into memory modelling.
Some other packages you might check out could include Octavian.jl (CPU), Tullio.jl (CPU or GPU), and GemmKernels.jl (GPU).