Search code examples

Why is batch mode so much faster than parfor?

I am writing matlab code to perform a 3 dimensional integral:

function [ fint ] = int3d_ser(R0, Rf, N)
Nr = N;
Nt = round(pi*N);
Np = round(2*pi*N);

rs = linspace(R0, Rf, Nr);
ts = linspace(0, pi, Nt);
ps = linspace(0, 2*pi, Np);

dr = rs(2)-rs(1);
dt = ts(2)-ts(1);
dp = ps(2)-ps(1);

C = 1/((4/3)*pi);
fint = 0.0;
for ir = 2:Nr
  r = rs(ir);
  r2dr = r*r*dr;
  for it = 1:Nt-1
    t = ts(it);
    sintdt = sin(t)*dt;
    for ip = 1:Np-1
      p = ps(ip);
      fint = fint + C*r2dr*sintdt*dp;


for the associated int3d_par (parfor) version, I open a matlab pool and just replace the for with a parfor. I get pretty decent speedup with I run it on more cores (my tests are from 2 to 8 cores).

However, when I run the same integration in batch mode with:

function [fint] = int3d_batch_cluster(R0, Rf, N, cluster, ncores)

%%% note: This will not give back the same value as the serial or parpool version.
%%%       If this was a legit integration, I would worry more about even dispersion
%%%       of integration nodes per core, but I just want to benchmark right now so ... meh

Nr = N;
Nt = round(pi*N);
Np = round(2*pi*N);

rs = linspace(R0, Rf, Nr);
ts = linspace(0, pi, Nt);
ps = linspace(0, 2*pi, Np);

dr = rs(2)-rs(1);
dt = ts(2)-ts(1);
dp = ps(2)-ps(1);

C = 1/((4/3)*pi);

rns = floor( Nr/ncores )*ones(ncores,1);
RNS = zeros(ncores,1);
for icore = 1:ncores
  if(sum(rns) ~= Nr) 
    rns(icore) = rns(icore)+1;
RNS(1) = rns(1);
for icore = 2:ncores
  RNS(icore) = RNS(icore-1)+rns(icore);

rfs = rs(RNS);
r0s = zeros(ncores,1);
r0s(2:end) = rfs(1:end-1);

j = createJob(cluster);

for icore = 1:ncores
  r0 = r0s(icore);
  rf = rfs(icore);
  rn = rns(icore);
  trs = linspace(r0, rf, rn);
  t{icore} = createTask(j, @int3d_ser, 1, {r0, rf, rn});

fints = fetchOutputs(j);

fint = 0.0;
for ifint = 1:length(fints)
  fint = fint + fints{ifint};


I notice that it is much, much faster. Why would doing this integration in batch mode be different than doing it in parfor?

For reference, I test the code with N from small numbers like 10 and 20 (to get the constant in the polynomial approximation of runtime) to larger numbers like 1000 and 2000. This algorithm will scale cubicly since I assign the number of integration nodes in the theta and phi direction to be a constant multiple of the given N.

For 2000 nodes, the parfor version takes about 630 seconds, while the same number of nodes in batch mode takes about 19 seconds (where around 12 seconds is simply overhead communication that we also get for 10 integration nodes).


  • After speaking with Mathworks support, it appears I had a fundamental misunderstanding of how parfor works. I was under the impression that parfor acted like openMP whereas batch mode was acting like mpi in terms of shared vs distributed memory.

    It turns out that parfor actually uses distributed memory as well. When I am creating, say, 4 batch functions, the overhead for creating a new process is happening 4 times. I thought that using a parfor would cause that overhead to happen just 1 time and that the parfor would then take place in the same memory space. This is not the case.

    In my example code, it turns out that for each iteration of the parfor, I am actually incurring the overhead of creating a new thread. When comparing 'apples to apples', I should really be creating the same number of batch calls as I am iterations in the parfor loop. This is why the parfor function was taking so much longer - I was incurring much more overhead for multiprocessing.