How to optimize the running speed of nested loops in matlab

I need to process elements of a four-dimensional matrix in matlab. Where N is around 100. The program below is very time consuming to calculate, I wonder if there is a suitable way to simplify it somewhat?

for i1 = 1 : N 
    for j1 = 1 : N 

        t1 = conj(PF(i1, j1));

        if abs(t1) ~= 0
            for i2 = 1 : N 
                for j2 = 1 : N 
                
                    t2 = PF(i2, j2);
                    mumual_intensity(i2, j2, i1, j1) = 
mumual_intensity(i2, j2, i1, j1) * t1 * t2;
                end
            end
        end
    end
end

I understand that it is possible to modify index gridding to vector calculations, but I have not been able to successfully implement this myself.

Solution

First of all, you should always order loops so that the first array dimension changes fastest (i.e. is the inner loop). This is how data is stored in memory, and accessing data in memory order makes best use of the cache. So reordering your loops as follows should significantly speed up your code:

for j1 = 1 : N  % last dimension is outer loop
    for i1 = 1 : N  % second to last dimension

        t1 = conj(PF(i1, j1));

        if abs(t1) ~= 0
            for j2 = 1 : N  % second dimension
                for i2 = 1 : N  % first dimension is inner loop
                
                    t2 = PF(i2, j2);
                    mumual_intensity(i2, j2, i1, j1) = mumual_intensity(i2, j2, i1, j1) * t1 * t2;
                end
            end
        end
    end
end

Assuming PF is an array, we can vectorize the multiplication easily. .* is the element-wise multiplication. mumual_intensity .* PF will automatically expand t2 to the same size as mumual_intensity by adding two dimensions at the end and replicating data along those dimensions. That is, each mumual_intensity(i2, j2, :, :) will be multiplied by PF(i2, j2).

For the other multiplication (by t1), we must first add two dimensions at the start, which we can do with shiftdim(PF,-2). Thus, the loop can be replaced by:

mumual_intensity = mumual_intensity .* conj(shiftdim(PF,-2)) .* PF;

On my M1 Mac with MATLAB R2023b, with N=100 and random double arrays, I see ~0.3s for the original code, ~0.2s with reordered loops, and ~0.08s for the vectorized code. The differences are not very big, but for sufficiently large N any gain is good.