Understanding Julia multi-thread / multi-process design

I would like to convert some of my code for scattering problem in MATLAB, and python to Julia. But after 1 day of reading docs, searching in Phind and reading posts like this Parallel computing in Julia - running a simple for-loop on multiple cores I still cannot understand Julia's Parallelism, neither MATLAB for that matter.

TL;DR: Let's forget about MATLAB vs Julia comparison. Someone please explain me why @distributed is 3 to some times 4 times faster than @threads here. If you want to know what led me to this question or copy the following code, please read the rest. Thank you.

In Julia's Manual we can see:

Julia's multi-threading provides the ability to schedule Tasks simultaneously on more than one thread or CPU core, sharing memory. This is usually the easiest way to get parallelism on one's PC or on a single large multi-core server. Julia's multi-threading is composable. When one multi-threaded function calls another multi-threaded function, Julia will schedule all the threads globally on available resources, without oversubscribing. https://docs.julialang.org/en/v1/manual/parallel-computing/

If I understand it correctly, all it says is that when there is only "1" server, parallelism can be achieved by means of threads. So if someone wants to convert Matlab parfor loop to its Julia equivalent, then @threads will do the job, right? Wrong!

Imagine I want to convert this Matlab (2023a) code to Julia (1.9.2) code:

tic
k = zeros(10000, 100, 100);
for i = 1:10000 
    k(i, :, :) = rand(100, 100)*rand(100, 100)* ...
    rand(100, 100)*rand(100, 100)*rand(100, 100)* ...
    rand(100, 100)*rand(100, 100)*rand(100, 100)*...
    rand(100, 100)*rand(100, 100);
end
toc

The equivalent is (roughly) the following (@fastmath makes the performance worse):

function f()
     k = zeros(10000, 100, 100)
     @simd for i = 1:10000
     @inbounds k[i, :, :] = rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100));
    end
    return k
end

@time t = f();

I am not using @btime, because after setting BenchmarkTools.DEFAULT_PARAMETERS.samples = 1 it crashes my Jupyter Lab (and also Vscode without Jupyter) for some reason.

Anyway my expectation was getting roughly the same amount of execution time after compiling Julia's code once, even though all I am hearing and seeing is that Julia's "for" loop is faster.That is wrong though. Matlab version is much more performant on my Intel Cori 7 8550U (it has 4 cores, and 8 logical threads), roughly 15~20% faster. So I looked at my CPU and saw that Matlab utilizes some form of parallelism automatically, using ~90% of CPU. So I told myself, maybe MATLAB JIT is clever, and I am comparing apples to oranges, so I changed Julia's code to the following:

function f()
     k = zeros(10000, 100, 100)
     Threads.@threads for i = 1:10000
     @inbounds k[i, :, :] = rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100));
    end
    return k
end

@time t = f();

And I got almost the same performance! "Ok, that is it", I told myself, so matlab uses parallelism underhood automatically, Julia doesn't do that. It is fine. But I was wrong. So I changed Matlab code to

tic
k = zeros(10000, 100, 100);
parfor i = 1:10000 
    k(i, :, :) = rand(100, 100)*rand(100, 100)* ...
    rand(100, 100)*rand(100, 100)*rand(100, 100)* ...
    rand(100, 100)*rand(100, 100)*rand(100, 100)*...
    rand(100, 100)*rand(100, 100);
end
toc

and it became 4 times faster than before! But wait, wasn't matlab already using parallelism? Why this code is faster then? We come back to that later. Let's talk about Julia for now. So I had just followed documentation, wrote down a parallel code that I thought would be as performant as Matlab code, but MATLAB ended up being 4 times faster in the end. What was wrong? After a bit of digging, I changed Julia's code to the following:

addprocs(8)
function f()
    k = SharedArray{Float64}(10000, 100, 100)
     @sync @distributed for i = 1:10000
     @inbounds k[i, :, :] = rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100));
    end
    return k
end

@time t = f();

And got the same performance as parfor Matlab code. Ok so far so good, when one wants to achieve parallelism, they should use @distributed then. But

what is up with Julia documentation then? Why does it sound as if when we have 1 server/pc, @threads would be some how equivalent to @distributed? I am not the only one confused by these macros, and all suggestion I see on internet is using both of them to see which one works better for a given problem! Here some examples: https://www.reddit.com/r/Julia/comments/cynbbb/question_multiprocessing_vs_multithreading/ https://discourse.julialang.org/t/overhead-of-distributed-module-vs-threads/21588 This makes no sense though. There should be one proper way of doing things, for a given problem. I understand more eccentric macros like @fastmath may need testing, but it makes no sense to me that there is no "proper" way of parallelism in Julia. Maybe there is, but I cannot understand it from documentation.
Why Julia doesn't use this multi-threading thing under-hood, just like MATLAB does? Is there some kind of language limitation, or is it by design like this, if so why? Now, every time that I want to convert some non-parallel (without parfor) MATLAB code to Julia I need to add @threads to get the same performance. Isn't this redundancy?
It does not make sense to me that normal MATLAB loop, or multi-threaded Julia's code, both use almost 4 cores, yet somehow MATLAB parfor or multi-process Julia code, out perform them significantly (4 times!) in term of speed. I mean, in both scenarios they are using same amount of resources (CPU wise, not Ram), so where this performance gain comes from?! I have 24GB RAM so it is not a bottleneck here.

And yes, I literally started learning Julia yesterday, so go easy on me.

Solution

In computing (using Julia syntax in point below) you have the following parallelism types:

@simd - utilize Single instruction, multiple data (SIMD) feature of a CPU
@async - coroutines aka green threads (good for IO parallelization)
@threads - multithreading, requires seting JULIA_NUM_THREADS environment variable or -t parameter when starting Julia
@distributed - multiprocessing on a single or multiple machines, need to use -p julia command line option or --machine-file option or addprocs() command
GPU computing - perhaps start with CUDA.jl

Now having this knowledge I briefly answer your equations:

Ad 1.

Multi-threading is on a single machine.

Multi-processing can occur on a single machine or multiple machines.

Multiprocessing on multiple machines is called distribute computing. Julia has basically the same API for multiprocessing and distributed computing - both are available via using Distributed

Finally note that in multi-processing a process can be single or multi-threaded.

The above concepts are not specific to Julia in any way (although Julia has nice APIs to achieve such functionality). You definitely should have a look at: https://en.wikipedia.org/wiki/Thread_(computing)#Threads_vs_processes

Ad 2.

Julia BLAS for linear algebra is using multi-threading independently of Julia configuration, try LinearAlgebra.BLAS.get_num_threads(). Some other Julia libraries use multi-threading and this depends on the value of JULIA_NUM_THREADS environment variable or -t parameter. So several libraries use it under-the-hood. However writing a multi-threaded code (in any programming language, Julia included) is difficult due to possible code parallelization problems such as deadlocks, thread chase or false sharing. It is not auto-magical and requires several hours spent on understanding the concepts.

Ad. 3.

Julia code outperform MATLAB's code. However writing a performant code is not easy. Perhaps start with testing yourself all codes from this list: https://docs.julialang.org/en/v1/manual/performance-tips/ For the code that you are running here Threads.@threads seems to be the best approach (assuming that you plan to use around 8-16 threads).

Note on the code

If you run this long enough your main problem are allocations. You could use in-place multiplication and hence allocate much less memory (and have less garbage collector events).

begin
    Y = Matrix{Float64}(undef, 100, 100)
    A = rand(100,100)
    B = Matrix{Float64}(undef, 100, 100)
    for i = 1:10
        rand!(B)
        mul!(Y,A,B)
        if i < 10
            A .= Y
        end
    end
    Y
end;

The code above has (6 allocations: 234.52 KiB) while normal multiplication (38 allocations: 1.45 MiB). This will make a noticeable difference when the workload is huge.

EDIT

Just tell me why distributed is 4 time faster than threads in Julia

OK this is tricky. There are few possible reasons (and most likely they combine for the observed behavior):

Your code is performing a huge number of allocations which is kicking-in the garbage collector. It is always easier to perform garbage collection on 8 separate disconnected processes than on a single multi-threaded one. When lots of garbage collection is involved it might be better to use @distributed than Threads.@threads. Additionally, from my empirical observations, beyond 8 jobs in Julia processes seem to scale better than threads. That is why you should consider in-place operations (LinearAlgebra.mul!) as suggested by myself and DNF
as noted by myself and DNF LinearAlgebra.BLAS is not composable with threading mechanism. This means that 8 processes are in the end utilizing eg. a total 32 threads (depending on your machine). This combined with the garbage collector makes the multi-threaded and distributed scenarios harder to compare.

In conclusion, for garbage collector-heavy-jobs you will generally see a better performance with multi-processing (using Distributed) than multi-threading.

EDIT 2

MATLAB is using MKL and MersenneTwister. When making benchmark it might be worth writing the Julia code the same way.

Multiplying two rands (uses Xoshori 256++ and OpenBLAS) yields the following result )

julia> @btime rand(100,100)*rand(100,100);
  118.500 μs (6 allocations: 234.52 KiB)

Let's try MersenneTwister and MKL (see also https://github.com/JuliaLinearAlgebra/MKL.jl ):

julia> using MKL, Random

julia> mt = MersenneTwister();

julia> @btime rand($mt, 100,100)*rand($mt, 100,100);
  68.900 μs (6 allocations: 234.52 KiB)

Worth knowing that changing this code to MKL and MT is a 2x speed up.