OpenMP reduce array without thread-local copies

I wish to OpenMP reduce a large array into a smaller dynamic array. For example, where

large[] = {1, 1, 1, 2, 2, 2, 3, 3, 3};

// OpenMP reduce length-3 sublists of large, to produce:

small[] = {3, 6, 9};

My requirements are similar to this question, but I have important additional constraints:

must support OpenMP 3.1 (so I cannot use OpenMP 4.5's array reductions, as this answer does)
cannot have a thread-private copy of small (like this answer), since small can be just as large as large, and hence thread-private copies may cause a stack overflow
large array must be iterated in order, for good caching (since it may be very large)

Some other details:

the summed elements of large are not necessarily adjacent
this reduction happens within a function. The small (output) array is pre-allocated by the user, and passed by reference. Ideally, the function should not allocate a new local copy of small
both small and large have lengths of a power of 2 (e.g. 2, 4, 8, 16 ...). Every element of small is reduced from the same number of large elements (another power of 2).

Here is example serial psuedocode, for clarity:

void myfunc(double* small, double* large, int lLen, // other args) {

    for (int i=0; i<lLen; i++) {
    
        int sInd = // determined by other args
        
        small[sInd] += large[i];
    }
}

Here is an example implementation, disqualified since it makes use of OpenMP 4.5's array reduction. It furthermore undesirably makes use of a local copy of small (as required to use reduction).

void myfunc(double* small, int sLen, double* large, int lLen, // other args) {

    double smallLocal[sLen];
    int i, sInd;

    #pragma omp parallel private (i) for
    for (i=0; i<sLen; s++)
        smallLocal[i] = 0;
    
    #pragma omp parallel private (i, sInd) reduction (+:smallLocal) for
    for (i=0; i<largeLen; i++) {

        sInd = // determined by other args
        
        smallLocal[sInd] += large[i];
    }

    #pragma omp parallel private (i) for
    for (i=0; i<sLen; s++)
        small[i] = smallLocal[i];
}

I'm hoping I can achieve the same thing in OpenMP 3.1, possibly avoiding even the local copy of smallLocal, by managing the array element reductions myself. How can I do this?

Solution

You could use OpenMPs atomic update construct, which is already present in OpenMP 3.1:

void myfunc(double* small, double* large, int lLen, // other args) {

    #pragma omp parallel for
    for (int i=0; i<lLen; i++) {
    
        int sInd = // determined by other args
        #pragma omp atomic update
        small[sInd] += large[i];
    }
}

This should be faster than using locks.