arrays performance matlab parallel-processing vectorization

Matlab: Speeding up large array operations - vectorization?

I am looking to significantly speed up the below code that involves operations on large arrays (specifically - the array called "proj"). The calculation of "proj" itself is fast enough compared to the rest of the code - see the "variables" section. However, operations with "proj" seem to be taking a lot of time.

There are the three versions of the code below, all in their own sections titled "original code, nested for loops", "making proj 1d before for loops", and "vectorization attempt, making proj 4d". All three methods seem to offer similar speeds (very slow). Although there are parts of the code that are not optimal, a time profile shows that the bottlenecks for all three versions of the code have to do with operations on "proj", which take up about >90% of the computation time (calculation of temp and result, lines 38-39, 61-62, 80-81).

I have also attempted to use parfor on some of the outer loops, with minimal improvement. In the future, I will be increasing the size of the variables D, A, B, C, and xyz, so time savings are critical here.

%% variables

D = 5;
A = 360/D;
B = 0.2:0.5:5.2;
C = 100;
xyz = rand(500,3);

qx = cos(deg2rad(D).*(1:A));
qy = sin(deg2rad(D).*(1:A));

Rij = zeros(size(xyz,1),size(xyz,1),3);
for i1 = 1:size(Rij,3)
    for i2 = 1:size(Rij,2)
        for i3 = 1:size(Rij,1)
            Rij(i3,i2,i1) = xyz(i2,i1)-xyz(i3,i1);
        end
    end
end

proj = zeros(size(xyz,1),size(xyz,1),A);
Ccount = zeros(A,C);
for i1 = 1:A
    proj(:,:,i1) = Rij(:,:,1)*qx(i1) + Rij(:,:,2)*qy(i1);
    [Ccount(i1,1:C),~,~] = histcounts(proj(:,:,i1),C);
end

%% original code, nested for loops

tic;

result1 = zeros(A,length(B));

for i1 = 1:A
    for i2 = 1:length(B)
        temp = zeros(size(xyz,1),size(xyz,1));
        for i3 = 1:C
            temp = temp + (1./size(xyz,1)).*Ccount(i1,i3).*...
                cos(B(i2).*proj(:,:,i1));
        end
        temp = sum(temp,[1 2],'omitnan');
        result1(i1,i2) = temp;
    end
end
timer1 = toc;
disp(['time: ' num2str(round(timer1,1)) ' s']);

%% making proj 1D before for loops

tic;

proj1d = proj(:);
result2 = zeros(A,length(B));

for i1 = 1:A
    for i2 = 1:length(B)
        temp = zeros(size(xyz,1)*size(xyz,1),1);
        for i3 = 1:C
            i4 = i1*size(xyz,1).^2;
            i5 = i4 - size(xyz,1).^2 + 1;
            temp = temp + (1./size(xyz,1)).*Ccount(i1,i3).*...
                cos(B(i2).*proj1d(i5:i4));
        end
        temp = sum(temp,[1 2],'omitnan');
        result2(i1,i2) = temp;
    end
end
timer1 = toc;
disp(['time: ' num2str(round(timer1,1)) ' s']);

%% vectorization attempt (making proj 4d)

tic;

proj4d = repmat(proj,1,1,1,length(B));
B4d = repmat((permute(B,[1 3 4 2])),size(xyz,1),size(xyz,1),A);
temp = zeros(A,length(B));

for i3 = 1:C
    temp = temp + 1./size(xyz,1).*Ccount(:,i3).*...
        squeeze(sum(cos(proj4d.*B4d),[1 2],'omitnan'));
end
result3 = temp;
timer1 = toc;
disp(['time: ' num2str(round(timer1,1)) ' s']);

Solution

In your original "nested for loops" code, the inner loop recomputes over and over again the same things. Computations that do not depend on the inner loop variable, i3, should be moved out of the inner loop:

tic;
result1 = zeros(A,length(B));
for i1 = 1:A
    for i2 = 1:length(B)
        temp = zeros(size(xyz,1),size(xyz,1));
        for i3 = 1:C
            temp = temp + (1./size(xyz,1)) .* Ccount(i1,i3) .* ...
                cos(B(i2).*proj(:,:,i1));
        end
        temp = sum(temp,[1 2],'omitnan');
        result1(i1,i2) = temp;
    end
end
toc

turns into:

tic;
result2 = zeros(A,length(B));
for i1 = 1:A
    for i2 = 1:length(B)
        temp = zeros(size(xyz,1),size(xyz,1));
        for i3 = 1:C
            temp = temp + (1./size(xyz,1)) .* Ccount(i1,i3);
        end
        temp = temp .* cos(B(i2).*proj(:,:,i1));
        temp = sum(temp,[1 2],'omitnan');
        result2(i1,i2) = temp;
    end
end
toc

This reduces the execution time on my machine from 54.5 s to 3.2 s.

We now note that what we're accumulating in temp is a scalar, it only becomes a matrix when combining the result with proj(:,:,i1). So let's keep temp a scalar:

tic;
result2 = zeros(A,length(B));
for i1 = 1:A
    for i2 = 1:length(B)
        temp = 0;
        for i3 = 1:C
            temp = temp + (1./size(xyz,1)) .* Ccount(i1,i3);
        end
        temp = temp .* cos(B(i2).*proj(:,:,i1));
        temp = sum(temp,[1 2],'omitnan');
        result2(i1,i2) = temp;
    end
end
toc

This now takes 0.69 s.

But the division by size(xyz,1) is also independent of i3, we can factor that out as well. So the inner loop just sums over Ccount(i1,:), which we don't need a loop for:

tic;
result2 = zeros(A,length(B));
for i1 = 1:A
    for i2 = 1:length(B)
        temp = sum(Ccount(i1,:)) ./ size(xyz,1);
        temp = temp .* cos(B(i2).*proj(:,:,i1));
        temp = sum(temp,[1 2],'omitnan');
        result2(i1,i2) = temp;
    end
end
toc

This last change makes the code simpler, so it's good, but actually doesn't make a significant dent in the computation time, since other computations are more expensive.

Next, we note that the first line of the inner loop again doesn't depend on the inner loop variable, and can be done outside:

tic;
result2 = zeros(A,length(B));
for i1 = 1:A
    x = sum(Ccount(i1,:)) ./ size(xyz,1);
    for i2 = 1:length(B)
        temp = x .* cos(B(i2).*proj(:,:,i1));
        result2(i1,i2) = sum(temp,[1 2],'omitnan');
    end
end
toc

I'll leave it here. We're now at 0.63 s.

result2 is not identical to result1, because we changed the order of computation, we're getting different rounding errors. The differences should be near the precision of the double floating-point values, eps(result1):

max( abs( (result1-result2) ./ eps(result1) ), [], 'all' )

returns 2.