The best way to write code in Julia working on GPU's via ArrayFire

In Julia, I saw principally that to acelerate and optimizing codes when I work on a matrix, es better e.g.

-work by columns instead of by rows, this is for the way which Julia store the matrix.

-On loops could use @inbounds and @simd macros

-any function, macros or methods you could recommend it's welcome :D

But it seems that the above examples do not work when I use the ArrayFire package with a matrix stored on the GPU, similar codes in the CPU and GPU do not seem to favor the GPU that runs much slower in some cases, I think it shouldn't be like that, I think the problem is in the way of writing the code. Any help will be welcome

Solution

GPU computing should be done on optimized GPU kernels as much as possible. Indexing a GPU array is a small kernel that copies one value back to the CPU. This is really really bad for performance, so you should almost never index a GPUArray unless you have to (this is true for any implementation! It's just a hardware problem!)

Thus, instead of writing looping code for GPUs, you should write broadcasting ("vectorized") code. With the v0.6 broadcast changes, broadcasted operations are nearly as efficient as loops anyways (unless you hit this bug), so there's no reason to avoid them in generic code. However, there are cases where broadcasting is faster than looping, and GPUs is one big case.

Let me explain a little bit why. When you do the code:

@. A = B*C + D*E

it lowers to

A .= B.*C .+ D.*E

which then lowers to:

broadcast!((b,c,d,e)->b*c + d*e,A,B,C,D,E)

Notice that in there you have a fused anonymous function for the entire broadcast. For GPUArrays, this is then overwritten so that way a single GPU kernel is automatically created that performs this fused operation element-wise. Thus only one GPU kernel is required to do this whole operation! Notice that this is even more efficient than the R/Python/MATLAB way of doing GPU computing since those vectorized forms have temporaries and would require 4 kernels here, but this has no temporary arrays and is a single kernel, which is pretty much exactly how you'd write it if you were writing the kernel yourself. So if you exploit broadcast, then your GPU calculations will be fast.