Optimizing a simulation in CUDA.jl

I'm trying to write a tutorial on GPU computing in Julia. Everything went fine when demonstrating simple matrix operations, with GPU blowing their single-threaded and multi-threaded equivalents out of the water.

Now I am trying to come up with a more complex example, involving the generation of simulated data X and the calculation of some estimate β, and that's when things get weird. No matter what I do, the GPU (Nvidia RTX 2070) simulation performs some 20 times worse than its multi-threaded (20) counterpart.

Here's some code for an MRE:

# Meta-simulation constants =================================
replications = 10
n = 100
p = 2
μ = rand(replications)

# Multi-threaded simulations =================================
β_par = fill(0., (p, replications))
function parsim()
  Threads.@threads for r in 1:replications
    X = rand(Float16, (n, p)) .* μ[r]; # Sample data
    β = sum(X .^ 2, dims = 1);   # Estimate parameters
    β_par[:, r] = β
  end
end

# GPU simulations =================================
using CUDA
β_gpu = CuArray(fill(0., (p, replications)))
function gpusim()
  for r in 1:replications
    X = CuArray(rand(Float16, (n, p))) .* μ[r]; # Sample data
    β = sum(X .^ 2, dims = 1);   # Estimate parameters
    β_par[:, r] = β
  end
end

I've spent over a dozen hours trying to get gpusim() to perform at least as well as parsim(). I've read the CUDA.jl and LinearAlgebra.jl docs countless times, trying to figure out if I really needed to write my own kernel; After realizing I had little idea of what I was doing, I would try to find a simpler solution using linear algebra. Read more, got more confident, tried writing a kernel again. Rinse and repeat.

ChatGPT helped me find my bearings a bit, but its complicated solutions were no better than my naive gpusim() above.

Hopefully humans are still better at helping me understand how (and even if) I can get gpusim() to outperform parsim() for the task above.

Solution

Thanks to the tips from @paleonix and @loonatick, I was finally able to come up with a GPU function that performs as expected.

The trick, as I suspected but failed to implement, is to reformat the operations to minimize CPU-GPU communication and replace loops with matrix operations (mental note: use 3D arrays more often in my work).

I'm sure there are more improvements to be made to the solution below, but at least this gets the job done.

replications = 100
n = 100_000
p = 200
μ = rand(replications)

using CUDA
β_gpu = CUDA.fill(0., (1, p, replications))
μ_gpu = CuArray(reshape(repeat(μ, inner = n * p), (n, p, replications)))

function gpusim2!(β, n)
  p, reps = size(β)
  X = CUDA.rand(Float16, (n, p, replications)) .* μ_gpu # Sample data
  X² = X .^ 2
  β .= sum(X², dims = 1) # Estimate parameters
end

Using this and the code in the OP, this is the performance I get (using 20 CPU threads):

I'm really happy that even with a small example I'm able to showcase the power of GPU computing. I've tried using the same tricks on my seqsim() function (not shown on this page, does sequential computation) and was able to double the performance, but it was still at a couple hundred ms.