parallel-processing julia distributed-computing distributed-algorithm

Julia Distributed slow down to half the single core performance when adding process

I've got a function func that may cost ~50s when running on a single core. Now I want to run it on a server which has got 192-core CPUs for many times. But when I add worker processes to say, 180, the performance of each core slows down. The worst CPU takes ~100s to calculate func.

Can someone help me, please?

Here is the pseudo code

using Distributed
addprocs(180)
@everywhere include("func.jl") # defines func in every process

First try using only 10 workers

@sync @distributed for i in 1:10
    func()
end
@sync @distributed for i in 1:10
    @time func()
end

From worker #: 43.537886 seconds (243.58 M allocations: 30.004 GiB, 8.16% gc time)
From worker #: 44.242588 seconds (247.59 M allocations: 30.531 GiB, 7.90% gc time)
From worker #: 44.571170 seconds (246.26 M allocations: 30.338 GiB, 8.81% gc time)
...
From worker #: 45.259822 seconds (252.19 M allocations: 31.108 GiB, 8.25% gc time)
From worker #: 46.746692 seconds (246.36 M allocations: 30.346 GiB, 11.21% gc time)
From worker #: 47.451914 seconds (248.94 M allocations: 30.692 GiB, 8.96% gc time)

Seems not bad when using 10 workers
Now we use 180 workers

@sync @distributed for i in 1:180
    func()
end
@sync @distributed for i in 1:180
    @time func()
end

From worker #: 55.752026 seconds (245.20 M allocations: 30.207 GiB, 9.33% gc time)
From worker #: 57.031739 seconds (245.00 M allocations: 30.176 GiB, 7.70% gc time)
From worker #: 57.552505 seconds (247.76 M allocations: 30.543 GiB, 7.34% gc time)
...
From worker #: 96.850839 seconds (247.33 M allocations: 30.470 GiB, 7.95% gc time)
From worker #: 97.468060 seconds (250.04 M allocations: 30.827 GiB, 6.96% gc time)
From worker #: 98.078816 seconds (250.55 M allocations: 30.883 GiB, 10.87% gc time)

The time increases almost linearly from 55s to 100s.

I've checked by top command that CPU usage may not the bottleneck ("id" keeps >2%). The RAM usage, too (used ~20%).

Other version information: Julia Version 1.5.3 Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz

Update:

substitute func to minimal example (simple for loop) does not change the slowdown.
reducing the process number to 192/2 alleviates slowdown

The new pseudo code is

addprocs(96)
@everywhere function ss()
    sum=0
    for i in 1:1000000000
        sum+=sin(i)
    end
end

@sync @distributed for i in 1:10
    ss()
end
@sync @distributed for i in 1:10
    @time ss()
end

From worker #: 32.8 seconds ..(8 others).. 34.0 seconds

...
@sync @distributed for i in 1:96
    @time ss()
end

From worker #: 38.1 seconds ..(94 others).. 45.4 seconds

Solution

You are measuring the time it takes each worker to perform func() and observe performance decrease for a single process when going from 10 processes to 180 parallel processes.

This looks quite normal to me:

Intel cores use hyper-threading so you actually have 96 cores (in more detail - a hyper-threaded core adds only 20-30% performance). It means that 168 of your processes need to share 84 hyper-threaded cores and 12 processes get full 12 cores.
The CPU speed is determined by throttle temperature (https://en.wikipedia.org/wiki/Thermal_design_power) and of course there is so much more space when running 10 processes vs 180 processes
Your tasks are obviously competing for memory. They make a total of over 5TB of memory allocations and you machine has much less than that. The last mile in garbage collecting is always the most expensive one - so if your garbage collectors are squeezed and competing for memory the performance is uneven with surprisingly longer garbage collection times.

Looking at this data I would recommend you to try:

addprocs(192 ÷ 2)

and see how the performance is then going to change.