Filling an array using forall in Chapel

This is working fine on my laptop, but I'm wondering if this is going to cause problems at scale. Suppose I want to fill an array that is going to be very large, but each entry requires an intense matrix operation on a large, sparse, distributed matrix. Should I expect the following design to hold up?

var x: [1..N] real;

forall i in [1..N] {
  x[i] = reallyHeavyMatrixComputation(i);
}

Are there tips for keeping this sane? Should I use a dmapped for the domain of x or something?

Solution

Chapel's forall-loops can scale to multiple locales, but whether or not they will do so depends on what they are iterating over as well as the loop's body.

In more detail, key parallel loop policies like "How many tasks should be used?" and "Where should those tasks run?" are controlled by the loop's iterand. For example, in the following loop:

var x: [1..N] real;                        // declare a local array

forall i in 1..N do                        // iterate over its indices in parallel
  x[i] = reallyHeavyMatrixComputation(i);

the loop's iterand is the range 1..N. By default a range's iterator will create a number of local tasks equal to the number of processor units/cores on the current locale. As a result, the loop above would not get faster as more locales were utilized unless reallyHeavyMatrixComputation() itself contained on-clauses that distributed the computations across the locales in an intelligent way. If it did not contain any on-clauses at all, the computation would never leave locale #0 and would be shared-memory only.

In contrast, if using a forall loop to iterate over a distributed domain or array, the default policy is typically to run a number of tasks on each target locale equal to the number of processor cores on that locale in an "owner-computes" fashion. That is, each locale will execute the subset of iterations it owns as determined by the distribution. For example, given the loop:

use CyclicDist;                            // make use of the cyclic distribution module

var D = {1..N} dmapped Cyclic(startIdx=1); // declare a cyclically distributed domain
var x: [D] real;                           // declare an array over that domain

forall i in D do                           // iterate over the domain in parallel
  x[i] = reallyHeavyMatrixComputation();

D's indices will be distributed cyclically across the locales, so the parallel loop will create tasks on each locale that divide up that will execute the locale's indices. As a result, this loop should scale across multiple locales, even if reallyHeavyMatrixComputation() is a completely local computation.

Another way to write scalable parallel loops in Chapel is to invoke an explicit parallel iterator that will distribute the work across the locales itself. For example, Chapel version 1.16 added a distributed iterators package module that provides iterators that will distribute work for you. Returning to the first example, if it were re-written as:

use DistributedIters;                  // make use of the distributed iterator module                                                                          

var x: [1..N] real;                    // declare a local array

forall i in distributedDynamic(1..N) do    // distribute iterations across locales
  x[i] = reallyHeavyMatrixComputation(i);

then the invocation of the distributedDynamic iterator would become the loop's iterand instead of the range 1..N and control task creation. This iterator dynamically deals iterations out to locales using a specified chunk size (1 by default), and so can be used to make use of multiple locales in a scalable manner. See its documentation for further details