Search code examples
c++coptimizationcompiler-optimizationmicro-optimization

How can I accelerate this code (MWE!), e.g. using restrict


Is there any way I can accelerate this function:

void task(int I, int J, int K, int *L, int **ids, double *bar){ 
    double *foo[K];
    for (int k=0;k<K;k++)
        foo[k] = new double[I*L[k]];
        // I am filling these arrays somehow
        // This is not a bottleneck, hence omitted here        
    for (int i=0;i<I;i++)
        for (int j=0;j<J;j++){
            double tmp = 1.;
            for (int k=0;k<K;k++)
                tmp *= foo[k][i*L[k]+ids[j][k]]; //ids[j][k]<L[k]
            bar[i*J+j] = tmp;
        }
}

Typical values are: I = 100,000, J = 10,000, K=3, L=[50,20,60].

I read that the __restrict__ keyword/extension could help, but am not sure how to apply it here. For example, trying to put it into the definition of foo[k] = new double[...] I get error: '__restrict_ qualifiers cannot be applied to double. Furthermore, I don't know whether I should / how I could declare ids and ids[j], 1<= j<= J as restricted.

As a note, in my actual code, I execute as such tasks in parallel in as many threads as my CPU has cores.

I am writing mostly C-compatible C++, so solutions in both languages are welcome.


Solution

  • https://en.cppreference.com/w/c/language/restrict claims you can declare an array of restrict pointers to double like so in C99/C11:

    typedef double *array_t[10];
    restrict array_t foo;        // the type of a is double *restrict[10]
    

    But only gcc accepts that. I think this is a GCC-ism, not valid ISO C11. (gcc also accepts
    array_t restrict foo_r; but no other compilers accept that either.)

    ICC warns "restrict" is not allowed, clang rejects it with

    <source>:16:5: error: restrict requires a pointer or reference ('array_t' (aka 'double *[10]') is invalid)
        restrict array_t foo_r;
        ^
    

    MSVC rejects it with error C2219: syntax error: type qualifier must be after '*'

    We get essentially the same behaviour in C++ from these compilers with __restrict, which they accept as a C++ extension with the same semantics as C99 restrict.


    As a workaround, you can instead use a qualified temporary pointer every time you read from foo, instead of f[k][stuff]. I think this promises that the memory you reference through fk isn't the same memory you access through any other pointers within the block where fk is declared.

    double *__restrict fk = foo[k];
    tmp *= fk[ stuff ];
    

    I don't know how to promise the compiler that none of the f[0..K-1] pointers alias each other. I don't think this accomplishes that.


    You don't need __restrict here.

    I added __restrict to all the pointer declarations, like int *__restrict *__restrict ids and it doesn't change the asm at all, according to a diff pane on the Godbolt compiler explorer: https://godbolt.org/z/4YjlDA. As we'd expect because type-based aliasing lets the compiler assume that a double store into bar[] doesn't modify any of the int * elements of int *ids[]. As people said in comments, there's no aliasing here that the compiler can't already sort out. And in practice it appears that it does sort it out, without any extra reloads of pointers.

    It also can't alias *foo[k], because we got those pointers with new inside this function. They can't be pointing inside bar[].

    (All the major x86 C++ compilers (GCC,clang,ICC,MSVC) support __restrict in C++ with the same behaviour as C99 restrict: a promise to the compiler that stores through this pointer don't modify objects that are pointed to by another pointer. I'd recommend __restrict over __restrict__, at least if you mostly want portability across x86 compilers. I'm not sure about outside of that.)

    It looks like you're saying you tried to put __restrict__ into an assignment, not a declaration. That won't work, it's the pointer variable itself that __restrict applies to, not a single assignment to it.


    The first version of the question had a bug in the inner loop: it had K++ instead of k++, so it was pure undefined behaviour and the compilers got weird. The asm didn't make any sense (e.g. no FP multiply instruction, even when foo[] was a function arg). This is why it's a good idea to use a name like klen instead of K for an array dimension.

    After fixing that on the Godbolt link, there's still no difference in the asm with / without __restrict on everything, but it's a lot more sane.

    BTW, making double *foo[] a function arg would let us look at the asm for just the main loop. And you would actually need __restrict because a store to bar[] could modify an element of foo[][]. This doesn't happen in your function because the compiler knows that new memory isn't pointed-to by any existing pointers, but it wouldn't know that if foo was a function arg.


    There's a small amount of the work inside the loop is sign-extending 32-bit int results before using them as array indices with 64-bit pointers. This adds a cycle of latency in there somewhere, but not the loop-carried FP multiply dependency chain so it may not matter. You can get rid of one instruction inside the inner loop on x86-64 by using size_t k=0; as the inner-most loop counter. L[] is a 32-bit array, so i*L[k] needs to be sign-extended inside the loop. Zero-extension from 32 to 64-bit happens for free on x86-64, so i * (unsigned)L[k] saves a movsx instruction in the pointer-chasing dep chain. Then the inner loop that gcc8.2 makes is all necessary work, required by your nasty data structures / layout. https://godbolt.org/z/bzVSZ7

    I don't know whether that's going to make a difference or not. I think more likely the memory access pattern causing cache misses will be your bottleneck with real data.

    It also can't auto-vectorize because the data isn't contiguous. You can't get contiguous source data from looping over j or i, though. At least i would be a simple stride without having to redo ids[j][k].

    If you generate foo[k][...] and bar[...] transposed, so you index with foo[k][ i + L[k] * ids[j][k] ], then you'd have contiguous memory in src and dst so you (or the compiler) could use SIMD multiplies.