performance parallel-processing low-latency chapel parallelism-amdahl

CyclicDist goes slower on multiple locales

I tried doing an implementation of Matrix multiplication using CyclicDist module.

When I test with one locale vs two locales, the one locale is much faster. Is it because the time to communicate between the two Jetson nano boards is really big or is my implementation not taking advantage of the way CyclicDist works?

Here is my code:

 use Random, Time, CyclicDist;
var t : Timer;
t.start();

config const size = 10;
const Space = {1..size, 1..size};

const gridSpace = Space dmapped Cyclic(startIdx=Space.low);
var grid: [gridSpace] real;
fillRandom(grid);
const gridSpace2 = Space dmapped Cyclic(startIdx=Space.low);
var grid2: [gridSpace2] real;
fillRandom(grid2);
const gridSpace3 = Space dmapped Cyclic(startIdx=Space.low);
var grid3: [gridSpace] real;
forall i in 1..size do {
    forall j in 1..size do {
        forall k in 1..size do {
            grid3[i,j] += grid[i,k] * grid2[k,j];
        }
    }
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear()

I know my implementation is not optimal for distributed memory systems.

Solution

Probably the main reason that this program is not scaling is that the computation never uses any locales other than the initial one. Specifically, forall loops over ranges, like the ones in your code:

forall i in 1..size do

always run all of their iterations using tasks executing on the current locale. This is because ranges are not distributed values in Chapel and as a result, their parallel iterators don't distribute work across locales. As a result of this, all size**3 executions of the loop body:

grid3[i,j] += grid[i,k] * grid2[k,j];

will run on locale 0 and none of them will run on locale 1. You can see that this is the case by putting the following into the innermost loop's body:

writeln("locale ", here.id, " running ", (i,j,k));

(where here.id prints out the ID of the locale where the current task is running). This will show that locale 0 is running all iterations:

0 running (9, 1, 1)
0 running (1, 1, 1)
0 running (1, 1, 2)
0 running (9, 1, 2)
0 running (1, 1, 3)
0 running (9, 1, 3)
0 running (1, 1, 4)
0 running (1, 1, 5)
0 running (1, 1, 6)
0 running (1, 1, 7)
0 running (1, 1, 8)
0 running (1, 1, 9)
0 running (6, 1, 1)
...

Contrast this with running a forall loop over a distributed domain like gridSpace:

forall (i,j) in gridSpace do
  writeln("locale ", here.id, " running ", (i,j));

where the iterations will be distributed between the locales:

locale 0 running (1, 1)
locale 0 running (9, 1)
locale 0 running (1, 2)
locale 0 running (9, 2)
locale 0 running (1, 3)
locale 0 running (9, 3)
locale 0 running (1, 4)
locale 1 running (8, 1)
locale 1 running (10, 1)
locale 1 running (8, 2)
locale 1 running (2, 1)
locale 1 running (8, 3)
locale 1 running (10, 2)
...

Since all of the computation is running on locale 0 but half of the data is located on locale 1 (due to the arrays being distributed), lots of communication is generated to fetch remote values from locale 1's memory to locale 0's in order to compute on it.