Search code examples
copenmpscalarreduction

How to use reduction vector variable in OpenMP?


I want to use the reduction directive in openmp, but it's not working. Compiler error says:

"reduction: OpenMP 'parallel for' empty factor in directive" (Visual studio 2015 community)

or

"reduction:^ needed scalar variable"

This is my code (Byte is unsigned char)

void RotXOR (const Byte *s, int n, Byte *t)
{
    int i = 0, q;
    q = n / 8; n %= 8;

    #pragma omp parallel for private(i) reduction()
    for (i = 0; i < 16; i++) {
        t[(q + i) % 16] ^= (s[i] >> n);

        if (n != 0) {
            t[(q + i + 1) % 16] ^= (s[i] << (8 - n));
        }
    }
}

Solution

  • From OpenMP 4.0 standard p171 for C/C++:

    Arrays may not appear in a reduction clause.

    So the only way of doing that would be to create a local per-thread "tt" array initialised to 0, to compute on top of it and to update t with tt atomically upon exit of the parallel section.

    But irrespective of that, since your loop trip count is only 16, the parallelisation overhead would be far greater than any potential gain, so from my standpoint, this is just a dead end.


    EDIT: this is what I had in mind:

    void RotXOR( const Byte *s, int n, Byte *t ) {
        int q = n / 8;
        n %= 8;
    
        #pragma omp parallel
        {
            int tt[] = { 0, 0, 0, 0,
                         0, 0, 0, 0,
                         0, 0, 0, 0,
                         0, 0, 0, 0 };
            #pragma omp for
            for ( int i = 0; i < 16; i++ ) {
                tt[( q + i ) % 16] ^= ( s[i] >> n );
                if ( n != 0 ) {
                    tt[( q + i + 1 ) % 16] ^= ( s[i] << ( 8 - n ) );
                }
            }
            #pragma omp critical
            for ( int i = 0; i < 16; i++ ) {
                t[i] ^= tt[i];
            }
        }
    }
    

    And I said I don't expect much performance improvement (if any) because the trip count being very small, not much work can be distributed amongst threads to hide the overhead of the thread management, and the sequential final reduction.

    While writing this solution, another one came to my mind, but I don't know which of the two would perform best... I suspect that the second version will be even worse than the first due to heavy synchronisation overhead and false sharing of t, but I'm not sure...

    void RotXOR( const Byte *s, int n, Byte *t ) {
        int q = n / 8;
        n %= 8;
    
        #pragma omp parallel for
        {
            for ( int i = 0; i < 16; i++ ) {
                int idx = ( q + i ) % 16;
                int val = s[i] >> n;
                #pragma omp atomic
                t[idx] ^= val;
                if ( n != 0 ) {
                    idx = ( q + i + 1 ) % 16;
                    val = s[i] << ( 8 - n );
                    #pragma omp atomic
                    t[idx] ^= val;
                }
            }
        }
    }
    

    And finally, since the value of n is known upon entry, I guess that removing the if statement from the loop would be a good idea, even if it implies writing a bit more code.