I wish to OpenMP reduce a large array into a smaller dynamic array. For example, where
large[] = {1, 1, 1, 2, 2, 2, 3, 3, 3};
// OpenMP reduce length-3 sublists of large, to produce:
small[] = {3, 6, 9};
My requirements are similar to this question, but I have important additional constraints:
small
(like this answer), since small
can be just as large as large
, and hence thread-private copies may cause a stack overflowlarge
array must be iterated in order, for good caching (since it may be very large)Some other details:
large
are not necessarily adjacentsmall
(output) array is pre-allocated by the user, and passed by reference. Ideally, the function should not allocate a new local copy of small
small
and large
have lengths of a power of 2
(e.g. 2
, 4
, 8
, 16
...). Every element of small
is reduced from the same number of large
elements (another power of 2).Here is example serial psuedocode, for clarity:
void myfunc(double* small, double* large, int lLen, // other args) {
for (int i=0; i<lLen; i++) {
int sInd = // determined by other args
small[sInd] += large[i];
}
}
Here is an example implementation, disqualified since it makes use of OpenMP 4.5's array reduction. It furthermore undesirably makes use of a local copy of small
(as required to use reduction
).
void myfunc(double* small, int sLen, double* large, int lLen, // other args) {
double smallLocal[sLen];
int i, sInd;
#pragma omp parallel private (i) for
for (i=0; i<sLen; s++)
smallLocal[i] = 0;
#pragma omp parallel private (i, sInd) reduction (+:smallLocal) for
for (i=0; i<largeLen; i++) {
sInd = // determined by other args
smallLocal[sInd] += large[i];
}
#pragma omp parallel private (i) for
for (i=0; i<sLen; s++)
small[i] = smallLocal[i];
}
I'm hoping I can achieve the same thing in OpenMP 3.1, possibly avoiding even the local copy of smallLocal
, by managing the array element reductions myself. How can I do this?
You could use OpenMPs atomic update
construct, which is already present in OpenMP 3.1:
void myfunc(double* small, double* large, int lLen, // other args) {
#pragma omp parallel for
for (int i=0; i<lLen; i++) {
int sInd = // determined by other args
#pragma omp atomic update
small[sInd] += large[i];
}
}
This should be faster than using locks.