Designing Parallel For Loops to Enforce Code Execution Order

I have a general question about how to structure parallel for loops when designing programs for a parallel system.

Say I have two commands, foo1() and foo2(), that I want to execute for each element of an array in parallel. It is important that foo2() does NOT execute at all until EVERY element been operated on by foo1().

Step 1: call foo1() on all elements

Step 2: When all step 1 complete, call foo2() on all elements

My assumption is that, to ensure foo1() completes on every single element before foo2() begins, I must place the two functions in separate parfor loops:

parfor(each element n in array){      //step 1
   foo1(n);
}
parfor(each element n in array){      //step 2
   foo2(n);
}

However this assumption may be wrong. It may be possible to achieve the same effect with this:

parfor(each element n in array){      
   foo1(n);                          //step 1
   foo2(n);                          //step 2
}

My concern about this second implementation is that individual processors may move on to foo2() before all of foo1() ops have completed.

Of the two implementations above, will both achieve the same affect? Or am I correct in assuming that only the first one with two separate parfor loops will?

Thank you very much for your input.

Solution

You are correct. The second implementation does not ensure that foo1() completes on every single element before foo2() begins.

if the numbers of cores (or threads) is smaller than n, you can be sure that your desideratum won't hold. Even if the number of cores is equal-or-larger than n it is very likely that one of the cores will start foo2() before all other cores completed foo1().

BTW, you may try to do something like

foo1(n) { write(f, "foo1 begin", n); do_some_math; write(f, "foo1 end", n); }
foo2(n) { write(f, "foo2 begin", n); do_some_math; write(f, "foo2 end", n); }

("correct" output would guarantee nothing, but "incorrect" output would prove that an implementation is "wrong").