Search code examples
performancematlabgpuvectorizationbsxfun

Vectorizing nested loops in matlab using bsxfun and with GPU


For loops seem to be extremely slow, so I was wondering if the nested loops in the code shown next could be vectorized using bsxfun and maybe GPU could be introduced too.

Code

%// Paramaters
i = 1;
j = 3;
n1 = 1500;
n2 = 1500;

%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;

%// Nested Loops - I 
for x = 1:n1
    for y = 1:n1
        num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
        LInc(x, y) = L1(x, y) + (num/denom);
        LInc(y, x) = LInc(x, y);
    end
end

%// Nested Loops - II
for x = 1:n1
    for y = 1:n2
        num = (n1 * n * L1(x,i)) + (n2 * n * L2(y,j)) - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1)));
        LInc(x, n1+y) = num/denom;
        LInc(n1+y, x) = LInc(x, n1+y);
    end
end

Edit 1: n and denom could be assumed as constants too.


Solution

  • Here are vectorized CPU and GPU codes and I am hoping that I am using at least good practices for the GPU code and the benchmarking later on.

    CPU Code

    %// Pre-allocate for output
    LInc(n1+n2,n1+n2)=0;
    
    %// Calculate num/denom value for stage 1 and 2
    nd1 = L1 + (((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - n2*n*bsxfun(@plus,L1(:,i),L1(:,i).'))./denom; %//'
    nd2 = (bsxfun(@plus,n1*n*L1(:,i),n2*n*L2(:,j).') - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1))))./denom; %//'
    
    %// Plug in the values in the output matrix
    LInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
    LInc(n1+1:end,1:n1) = nd2.';  %//'
    LInc(1:n1,n1+1:end) = nd2;
    

    GPU Code

    %// Pre-allocate for output
    gLInc = zeros(n1+n2,n1+n2,'gpuArray');
    
    %// Convert to gpu arrays
    gL1 = gpuArray(L1);
    gL2 = gpuArray(L2);
    
    %// Calculate num/denom value for stage 1 and 2
    nd1 = gL1 + (((n2 ^ 2) * (gL1(i, i) + gL2(j, j) + 1)) - n2*n*bsxfun(@plus,gL1(:,i),gL1(:,i).'))./denom; %//'
    nd2 = (bsxfun(@plus,n1*n*gL1(:,i),n2*n*gL2(:,j).') - ((n1 * n2 * (gL1(i, i) + gL2(j, j) + 1))))./denom; %//'
    
    %// Plug in the values in the output matrix
    gLInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
    gLInc(n1+1:end,1:n1) = nd2.';  %//'
    gLInc(1:n1,n1+1:end) = nd2;
    
    %// Gather data from GPU back to CPU
    LInc = gather(gLInc);
    

    Benchmarking

    GPU benchmarking tips were taken from Measure and Improve GPU Performance.

    %// Warm up GPU call with insignificant small scalar inputs, just in case
    %// gputimeit doesn't do the same
    temp1 = modp2(1,1,1,1,1,1,1,1); %// This is vectorized GPU code
    
    i = 1;
    j = 3;
    n = 1000; %// Assumed
    denom = 1e6;  %// Assumed
    
    N_arr = [50 100 200 500 1000 1500]; %// array elements for N (datasize)
    timeall = zeros(3,numel(N_arr));
    
    for k1 = 1:numel(N_arr)
        N = N_arr(k1);
        n1 = N;  %// n1, n2 are assumed identical for less-complicated benchmarking
        n2 = N;
    
        L1 = rand(n1,n1);
        L2 = rand(n2,j);
    
        f = @() modp0(i,j,n1,n2,L1,L2,n,denom);%// Original CPU w/ preallocation
        timeall(1,k1) = timeit(f);
        clear f
    
        f = @() modp1(i,j,n1,n2,L1,L2,n,denom);%// Vectorzied CPU code
        timeall(2,k1) = timeit(f);
        clear f
    
        f = @() modp2(i,j,n1,n2,L1,L2,n,denom);%// Vectorized GPU(GTX 750Ti) code
        timeall(3,k1) = gputimeit(f);
        clear f
    end
    
    %// Display benchmark results
    figure,hold on, grid on
    plot(N_arr,timeall(1,:),'-b.')
    plot(N_arr,timeall(2,:),'-ro')
    plot(N_arr,timeall(3,:),'-kx')
    legend('Original CPU','Vectorized CPU','Vectorized GPU (GTX 750 Ti)')
    xlabel('Datasize (N) ->'),ylabel('Time(sec) ->')
    

    Results

    enter image description here

    Conclusions

    Results show that the vectorized GPU code performs really well with higher datasize and goes from slower than both the vectorized CPU and original code to being twice as fast as the vectorized CPU code.