Improve efficiency matlab

I have a matlab code that is very inefficient and I need to run it several times. The code is basically a big parfor loop which I guess there is almost impossible to get around.

The code starts by loading several parameters and 4-D matrices, and then needs to make a few interpolations. All need to be done 5000 times (thus the parfor loop).

Here is how the code look like. I simplified the code the most I could without taking out the key ingredients.

load file

nsim = 5000
T = 12; 
N = 1000;

cumQx = cumsum(Qx);
cumQz = cumsum(Qz);
cumQs = cumsum(Qs);

for k=1:nsim
st(k).ksim    = kstar*ones(N, T);
st(k).Vsim  = zeros(N,T);  
st(k).Psim = zeros(N,T);   
end


parfor k = 1:nsim    

    sysrand  = rand(T, 1);
    idiorand = rand(N, T);
    sigmarand = rand(T,1);

    xid = zeros(T, 1);
    zid = zeros(N, T);
    sid = zeros(T,1);
    xid(1) = 8;
    zid(:, 1) = 5;
    sid(1) = 1;
    % Initializing the simulation

    simx    = zeros(T,1);
    zsim    = ones(N,T)*zbar;
    simsx    = zeros(T,1);

    % Construct 3-D grid using 'ndgrid'
    [ks,zs] = ndgrid(kgrid,z);

    for j = 2:T
            sid(j) = find(cumQs(:, sid(j-1)) >= sigmarand(j), 1); 
            simsx(j-1) = sigmax(sid(j));

             xid(j) = find(cumQx(:, xid(j-1)) >= sysrand(j), 1); 
             simx(j-1) = x(xid(j));

             for n = 1:N
                 zid(n, j)   = find(cumQz(:, zid(n, j-1)) >= idiorand(n, j), 1); 
                 zsim(n,j-1) = z(zid(n, j));
             end
            st(k).ksim(:,j)      = interpn(ks, zs , squeeze(kprime(:,xid(j),:,sid(j))),   st(k).ksim(:,j-1),zsim(:,j-1),'linear');       % K
            st(k).Vsim(:,j)      = interpn(ks, zs , squeeze(V(:,xid(j),:,sid(j))),        st(k).ksim(:,j-1),zsim(:,j-1),'linear');       % V
            st(k).Psim(:,j)      = interpn(ks, zs , squeeze(P(:,xid(j),:,sid(j))),        st(k).ksim(:,j-1),zsim(:,j-1),'linear');       % P


    end; 

end

Here is a link for the matrices needed to run the code: http://www.filedropper.com/file_406

Is there any better way of doing this that would significantly reduce the computation time? My guess is no ... Ideally there woul be a way of vectorizing the k = 1:nsim loop.

Solution

In order to time and test your code, I removed the parfor loop and replaced it with a for loop, then used the MATLAB profiler. I used nsims = 500 for the purpose of my tests.

Using the profiler, I found two key bottlenecks in your code. The first is the find() function in the most nested for loop (n-loop). The second is the three calls to the interpn() function. Those 4 lines use +88% of the computation time.

The find function is slower than desirable in this circumstance due to the overhead of the function call (especially considering the number of calls it receives in the nested loops), as well as the built-in error checking and management. Replacing the find function with a hard-coded binary search (below) greatly improves performance, and this was only replacing the find in the n-loop. Using find for nsims = 500 resulted in a run time of 29.8 seconds. Using the binary search, the run time was 12.1 seconds.Note: this only works because your data is sorted, this code cannot replace find in every instance. EDIT: Using the alternate approach in the other answer by @EBH is a cleaner way to do this.

%perform binary search (same as your find function)
searchVal=idiorand(n, j);
il = 1;
iu = sizeCumQZ; % should be defined outside the loop as size(cumQz,1)
while (il+1<iu)
    lw=floor((il+iu)/2); % split the upper index
    if cumQz(lw,zid(n, j-1)) >= searchVal
        iu=lw; % decrease lower_index_b (whose x value remains \geq to lower bound)
    else
        il=lw; % increase lower_index_a (whose x value remains less than lower bound)
    end
end
if cumQz(il,zid(n, j-1))>=searchVal
    zid(n,j) = il;
else
    zid(n,j) = iu;
end

The interpn function is considerably slowed by method checking, error management and data format management. The roughly 100 lines of code used in the standard interpn can be cut down to 2 lines per call and significantly increase performance by knowing we only need one type of interpolation, and that our data conforms to a specific format. To do this we use the griddedInterpolant function directly (see below). Again with nsims = 500, using the interpn function (still using the un-changed find code) has a run time of 29.8 seconds. Using the refined method, below, decreased run time to 20.4 seconds.

Refined method replaces the calls to interp, shown here

st(k).ksim(:,j)      = interpn(ks, zs , squeeze(kprime(:,xid(j),:,sid(j))),   st(k).ksim(:,j-1),zsim(:,j-1),'linear');       % K
st(k).Vsim(:,j)      = interpn(ks, zs , squeeze(V(:,xid(j),:,sid(j))),        st(k).ksim(:,j-1),zsim(:,j-1),'linear');       % V
st(k).Psim(:,j)      = interpn(ks, zs , squeeze(P(:,xid(j),:,sid(j))),        st(k).ksim(:,j-1),zsim(:,j-1),'linear');       % P

with more direct calls to griddedInterpolant, shown here:

F = griddedInterpolant(ks,zs,squeeze(kprime(:,xid(j),:,sid(j))), 'linear','none');
st(k).ksim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));

F = griddedInterpolant(ks,zs,squeeze(V(:,xid(j),:,sid(j))), 'linear','none');
st(k).Vsim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));

F = griddedInterpolant(ks,zs,squeeze(P(:,xid(j),:,sid(j))), 'linear','none');
st(k).Psim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));

Combining the binary search instead of find with the calls to griddedInterpolant instead of interpn reduces the total run time to 3.8 seconds, a nearly 8x improvement over initial run time.

Three further points of note:

1) I would recommend following the advice outlined in previous answers, specifically staying away from structures if possible and moving whatever you can out of the loops (specifically the ndgrid, as this most certainly only needs to be run one time).

2) I noticed that, when using nsims=5000, the total memory used during this script is near 2.5gigs. If this is near your system's total available memory, this could result in significant slowdown. In this case I would recommend performing smaller batches of calculations, saving out the results, and then proceeding with further calculations.

3)Finally, during my tests, using parfor was actually slower than using the standard for loop. I beleive this is because each individual loop is a relatively short operation in this instance, and the overhead associated with designating parallel jobs and managing the cluster workers outweighs the gains from parallel processing. This may not be the case if you are on a large computer cluster, but on my single (4-core) machine, parfor only resulted in slow-down. I would recommend testing each case for yourself,with your actual working code, after you have made the recommended changes above if you care to.

The full code changes I made are shown below, these do not include any changes in the use of structs or other small optimizations that would still be recommended.

load file

tic;

nsim = 500
T = 12; 
N = 1000;

searchVal=1;
il = 1;
iu = 1;

cumQx = cumsum(Qx);
cumQz = cumsum(Qz);
cumQs = cumsum(Qs);

sizeCumQZ = size(cumQz,1);

for k=1:nsim
st(k).ksim    = kstar*ones(N, T);
st(k).Vsim  = zeros(N,T);  
st(k).Psim = zeros(N,T);   
end

%was parfor
for k = 1:nsim    

    sysrand  = rand(T, 1);
    idiorand = rand(N, T);
    sigmarand = rand(T,1);

    xid = zeros(T, 1);
    zid = zeros(N, T);
    sid = zeros(T,1);
    xid(1) = 8;
    zid(:, 1) = 5;
    sid(1) = 1;
    % Initializing the simulation

    simx    = zeros(T,1);
    zsim    = ones(N,T)*zbar;
    simsx    = zeros(T,1);

    % Construct 3-D grid using 'ndgrid'
    [ks,zs] = ndgrid(kgrid,z);

    for j = 2:T
            sid(j) = find(cumQs(:, sid(j-1)) >= sigmarand(j), 1); 
            simsx(j-1) = sigmax(sid(j));

             xid(j) = find(cumQx(:, xid(j-1)) >= sysrand(j), 1); 
             simx(j-1) = x(xid(j));

             for n = 1:N
                 %perform binary search (same as your find function)
                 searchVal=idiorand(n, j);
                 il = 1;
                 iu = sizeCumQZ;
                 while (il+1<iu)
                     lw=floor((il+iu)/2); % split the upper index
                     if cumQz(lw,zid(n, j-1)) >= searchVal
                         iu=lw; % decrease lower_index_b (whose x value remains \geq to lower bound)
                     else
                         il=lw; % increase lower_index_a (whose x value remains less than lower bound)
                     end
                 end
                 if cumQz(il,zid(n, j-1))>=searchVal
                     zid(n,j) = il;
                 else
                     zid(n,j) = iu;
                 end
                 zsim(n,j-1) = z(zid(n,j));
             end

             F = griddedInterpolant(ks,zs,squeeze(kprime(:,xid(j),:,sid(j))), 'linear','none');
             st(k).ksim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));

             F = griddedInterpolant(ks,zs,squeeze(V(:,xid(j),:,sid(j))), 'linear','none');
             st(k).Vsim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));

             F = griddedInterpolant(ks,zs,squeeze(P(:,xid(j),:,sid(j))), 'linear','none');
             st(k).Psim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));

    end;

end
toc;

EDIT: Messing around with the griddedInterpolant a bit more, I was able to eek out an additional 15% speed increase by combining the three interpolants into one through concatenation of the grids and values for the K, V, and P interpolants. At the top of the code, best done outside the loop, I replaced the original grid creation with:

zRange = max(z(:))-min(z(:))+1;
zzzGrid = [z;z+1*zRange;z+2*zRange];% for K, V, and P 
[ksBig,zsBig] = ndgrid(kgrid,zzzGrid);
nZ = numel(z); %used below
valGrid = zeros(size(ksBig)); %used below

And replace the 3 calls to griddedInterpolant with:

valGrid(:,1:nZ) = squeeze(kprime(:,xid(j),:,sid(j)));%K
valGrid(:,nZ+1:2*nZ) = squeeze(V(:,xid(j),:,sid(j)));%V
valGrid(:,2*nZ+1:3*nZ) = squeeze(P(:,xid(j),:,sid(j)));%P

F = griddedInterpolant(ksBig,zsBig,valGrid, 'linear','none');

st(k).ksim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));
st(k).Vsim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1)+zRange);
st(k).Psim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1)+2*zRange);

Or if we are OK trading a more convoluted setup for a 19% increase instead of a 15% increase, we can move the griddedInterpolant outside of the j loop altogether. At the beginning of the code, setup the grid as such:

zRange = max(z(:))-min(z(:))+1;
zzzGrid = [z;z+1*zRange;z+2*zRange];
zzzRange =  max(zzzGrid(:))-min(zzzGrid(:))+1;
zzzTGrid = [];
for j = 2:T
   zzzTGrid(end+1:end+numel(zzzGrid)) = zzzGrid+(j-2)*zzzRange;
end
[ksBig,zsBig] = ndgrid(kgrid,zzzTGrid);
nZ = numel(z); %used below
valGrid = zeros(size(ksBig)); %used below

And replace the previous griddedInterpolant as shown below:

for j = 2:T
    %%%%%
    %...
    %Other code in the j loop
    %...
    %%%%%
    valGrid(:,(1:nZ)+3*nZ*(j-2)) = squeeze(kprime(:,xid(j),:,sid(j)));%K
    valGrid(:,(nZ+1:2*nZ)+3*nZ*(j-2)) = squeeze(V(:,xid(j),:,sid(j)));%V
    valGrid(:,(2*nZ+1:3*nZ)+3*nZ*(j-2)) = squeeze(P(:,xid(j),:,sid(j)));%P
end;

F = griddedInterpolant(ksBig,zsBig,valGrid, 'linear','none');

for j = 2:T
    st(k).ksim(:,j) = F(stTest(k).ksim(:,j-1),zsim(:,j-1)+3*zRange*(j-2));
    st(k).Vsim(:,j) = F(stTest(k).ksim(:,j-1),zsim(:,j-1)+zRange+3*zRange*(j-2));
    st(k).Psim(:,j) = F(stTest(k).ksim(:,j-1),zsim(:,j-1)+2*zRange+3*zRange*(j-2));
end

And if we want to get even pickier (2% speed increase), we can replace all calls to squeeze with a version that does no error checking, mySqueeze instead:

function b = mySqueeze(a)
%Trimmed down version of squeeze, a built-in MATLAB function, has no error-managment or case optimization
  siz = size(a);
  siz(siz==1) = []; % Remove singleton dimensions.
  siz = [siz ones(1,2-length(siz))]; % Make sure siz is at least 2-D
  b = reshape(a,siz);