arrays parallel-processing julia benchmarking parallel-arrays

Simple array operation in parallel using julia

I am working on a project which include some simple array operations in a huge array. i.e. A example here

function singleoperation!(A::Array,B::Array,C::Array)
@simd for k in eachindex(A)
   @inbounds C[k] = A[k] * B[k] / (A[k] +B[k]); 
end

I try to parallelize it to get a faster speed. To parallelize it, I am using distirbuded and share array function, which just modified a bit on the function I just show:

@everywhere function paralleloperation(A::SharedArray,B::SharedArray,C::SharedArray)
   @sync @distributed for k in eachindex(A)
       @inbounds C[k] = A[k] * B[k] / (A[k] +B[k]);  
    end
end

However, there has no time difference between two functions even I am using 4 threads (with the try on R7-5800x and I7-9750H CPU). Can I know anythings I can improve in this code? Thanks a lot! I will post the full testing code in below:

using Distributed
addprocs(4)
@everywhere begin
    using SharedArrays
    using BenchmarkTools
end

@everywhere function paralleloperation!(A::SharedArray,B::SharedArray,C::SharedArray)
   @sync @distributed for k in eachindex(A)
      @inbounds C[k] = A[k] * B[k] / (A[k] +B[k]); 
    end
end
function singleoperation!(A::Array,B::Array,C::Array)
    @simd for k in eachindex(A)
       @inbounds C[k] = A[k] * B[k] / (A[k] +B[k]); 
    end
end

N = 128;
A,B,C = fill(0,N,N,N),fill(.2,N,N,N),fill(.3,N,N,N);
AN,BN,CN = SharedArray(fill(0,N,N,N)),SharedArray(fill(.2,N,N,N)),SharedArray(fill(.3,N,N,N));


@benchmark singleoperation!(A,B,C);
BenchmarkTools.Trial: 1612 samples with 1 evaluation.
 Range (min … max):  2.582 ms …   9.358 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.796 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.086 ms ± 790.997 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%


@benchmark paralleloperation!(AN,BN,CN);

BenchmarkTools.Trial: 1404 samples with 1 evaluation.
 Range (min … max):  2.538 ms … 17.651 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.154 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.548 ms ±  1.238 ms  ┊ GC (mean ± σ):  0.08% ± 1.65%

Solution

As the comments note, this looks like perhaps more of a job for multithreading than multiprocessing. The best approach in detail will generally depend on whether you are CPU-bound or memory-bandwith-bound. With so simple a calculation as in the example, it may well be the latter, in which case you will reach a point of diminishing returns from adding additional threads, and and may want to turn to something featuring explicit memory modelling, and/or to GPUs.

However, one very easy general-purpose approach would be to use the multithreading built-in to LoopVectorization.jl

A = rand(10000,10000)
B = rand(10000,10000)
C = zeros(10000,10000)

# Base
function singleoperation!(A,B,C)
    @inbounds @simd for k in eachindex(A)
       C[k] = A[k] * B[k] / (A[k] + B[k])
    end
end

using LoopVectorization
function singleoperation_lv!(A,B,C)
    @turbo for k in eachindex(A)
      C[k] = A[k] * B[k] / (A[k] + B[k])
    end
end

# Multithreaded (make sure you've started Julia with multiple threads)
function threadedoperation_lv!(A,B,C)
    @tturbo for k in eachindex(A)
      C[k] = A[k] * B[k] / (A[k] + B[k])
    end
end

which gives us

julia> @benchmark singleoperation!(A,B,C)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
 Range (min … max):  163.484 ms … 164.022 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     163.664 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   163.701 ms ± 118.397 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                █  ▄ ▄▄ ▁   ▁                 ▁
  ▆▁▁▁▁▁▁▁▁▁▁▁▁▆█▆▆█▆██▁█▆▁▆█▁▁▁▆▁▁▁▁▆▁▁▁▁▁▁▁▁█▁▁▁▆▁▁▁▁▁▁▆▁▁▁▁▆ ▁
  163 ms           Histogram: frequency by time          164 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark singleoperation_lv!(A,B,C)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
 Range (min … max):  163.252 ms … 163.754 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     163.408 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   163.453 ms ± 130.212 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▃▃   ▃█▃   █ ▃        ▃
  ▇▁▁▁▁▁▇▁▁▁▇██▇▁▇███▇▁▁█▇█▁▁▁▁▁▁▁▁█▁▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▇▇▁▇▁▇ ▁
  163 ms           Histogram: frequency by time          164 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark threadedoperation_lv!(A,B,C)
BenchmarkTools.Trial: 57 samples with 1 evaluation.
 Range (min … max):  86.976 ms …  88.595 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     87.642 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   87.727 ms ± 439.427 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▅      █  ▂                              ▂
  ▅▁▁▁▁██▁▅█▁█▁▁▅▅█▅█▁█▅██▅█▁██▁▅▁▁▅▅▁▅▁▅▁▅▁▅▁▅▁▁▁▅▁█▁▁█▅▁▅▁▅█ ▁
  87 ms           Histogram: frequency by time         88.5 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, the fact that the singlethreaded LoopVectorization @turbo version is almost perfectly tied with the singlethreaded @inbounds @simd version is to me a hint that we are probably memory-bandwidth bound here (usually @turbo is notably faster than @inbounds @simd, so the tie suggests that the actual calculation is not the bottleneck) -- in which case the multithreaded version is only helping us by getting us access to a bit more memory bandwidth (though with diminishing returns, assuming there is some main memory bus that can only go so fast regardless of how many cores it can talk to).

To get a bit more insight, let's try making the arithmetic a bit harder:

function singlemoremath!(A,B,C)
    @inbounds @simd for k in eachindex(A)
       C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
    end
end

using LoopVectorization
function singlemoremath_lv!(A,B,C)
    @turbo for k in eachindex(A)
      C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
    end
end

function threadedmoremath_lv!(A,B,C)
    @tturbo for k in eachindex(A)
      C[k] = cos(log(sqrt(A[k] * B[k] / (A[k] + B[k]))))
    end
end

then sure enough

julia> @benchmark singlemoremath!(A,B,C)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.651 s …    2.652 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.651 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.651 s ± 792.423 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                        █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.65 s         Histogram: frequency by time         2.65 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark singlemoremath_lv!(A,B,C)
BenchmarkTools.Trial: 19 samples with 1 evaluation.
 Range (min … max):  268.101 ms … 270.072 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     269.016 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   269.058 ms ± 467.744 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁           █     ▁  ▁  ▁▁█ ▁  ██   ▁    ▁     ▁     ▁      ▁
  █▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁█▁▁███▁█▁▁██▁▁▁█▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█ ▁
  268 ms           Histogram: frequency by time          270 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark threadedmoremath_lv!(A,B,C)
BenchmarkTools.Trial: 56 samples with 1 evaluation.
 Range (min … max):  88.247 ms … 93.590 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     89.325 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   89.707 ms ±  1.200 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄ ▁ ▄█   ▄▄▄  ▁ ▄   ▁  ▁     ▁      ▁ ▁▄
  █▁█▆██▆▆▆███▆▁█▁█▁▆▁█▆▁█▆▆▁▁▆█▁▁▁▁▁▁█▆██▁▁▆▆▁▁▆▁▁▁▁▁▁▁▁▁▁▆▆ ▁
  88.2 ms         Histogram: frequency by time        92.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

now we're closer to CPU-bound, and now threading and SIMD-vectorization is the difference between 2.6 seconds and 90 ms!

If your real problem is going to be as memory-bound as the example problem, you may consider working on GPU, on a server optimized for memory bandwidth, and/or using a package that puts a lot of effort into memory modelling.

Some other packages you might check out could include Octavian.jl (CPU), Tullio.jl (CPU or GPU), and GemmKernels.jl (GPU).