c multithreading performance openmp grand-central-dispatch

Apple's dispatch vs OpenMP to parallelize a for loop on Apple MacBook Pro with M3Pro

I am writing a program in C that takes in an array of size 2^N and swaps entries whose indices differ by one bit at a specified position in the indices' binary representations. Profiling the code, I noticed that by default it only deploys only one thread, which makes it very slow. I tried OpenMP and also Apple's Dispatch (formerly Grand Central Dispatch) to parallelize my code, but I am a bit disappointed with the results. Is there someone more experienced with these frameworks and can help me out? Is Metal something I should look into?

My idea for the parallelized version of my code is to iterate through all integers up to 2^N-1-1, insert a zero at the specified position from the right and swap with the entry whose index is 2^pos away. The version I implemented using OpenMP looks like this:

void swapEntries(double complex* array, uint64_t N, uint64_t pos) {
    uint64_t swapDistance = 1 << pos        // 2^pos is the distance between entries to swap
    uint64_t indices = 1 << (N - 1)         // 2^(N-1) is the max integer missing a zero at pos

#pragma omp parallel for default(none) shared(array, indices, pos, swapDistance)
    {
        for (uint64_t i = 0; i < indices; ++i) {
            uint64_t left = (i & (~0U << pos)) << 1;    // Copy the bits left to pos and shift
                                                        // them one position to the left
            uint64_t right = (i & ~(~0U << pos));       // Copy the bits right of pos
            i = (left | right);                         // Merge the two integers (now with a
                                                        // zero at pos)

            double complex tmp = *(array + i);              // Swap the entry with the one
            *(array + i) = *(array + (i + swapDistance));   // swapDistance away
            *(array + (i + swapDistance)) = tmp;
        }
    };
}

On an instance with N=30, using Apple's clang with flag -O2 it runs nearly twice as fast as its serial version. I find this quite sobering given the fact my MacBook M3 with M3Pro has 11 cores (even if 7 of them are efficiency cores). In a next step I tried Apple's Dispatch library using a simple loop conversion given at wikipedia and also mentioned in the docs.

void swapEntries(double complex* array, uint64_t N, uint64_t pos) {
    uint64_t swapDistance = 1 << pos        // 2^pos is the distance between entries to swap
    uint64_t indices = 1 << (N - 1)         // 2^(N-1) is the max integer missing a zero at pos

    /* Get a concurrent dispatch queue */
    dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0)

    dispatch_apply(indices, queue, ^(size_t) {
        uint64_t left = (i & (~0U << pos)) << 1;    // Copy the bits left to pos and shift them
                                                    // one position to the left
        uint64_t right = (i & ~(~0U << pos));       // Copy the bits right of pos
        i = (left | right);                         // Merge the two integers (now with a
                                                        // zero at pos)

        double complex tmp = *(array + i);              // Swap the entry with the one
        *(array + i) = *(array + (i + swapDistance));   // swapDistance away
        *(array + (i + swapDistance)) = tmp;
    });
}

Both of these functions do exactly what they are supposed to; I have unit tests to check this. However, the Dispatch implementation is significantly worse than the serial version and I am pretty sure that this is my fault. Can you please help me?

Solution

OpenMP’s #pragma omp parallel for will take the iterations of the for loop and they will be “divided into contiguous non-empty subsets, called chunks”. (See section 11.5.3 of the OpenMP Application Programming Interface. This is also discussed in the ninth video of the Introduction to OpenMP training series.) Chunking maximizes the amount of work on each thread and minimizes the modest overhead introduced by parallelism.

GCD does not do this for you, and each iteration of dispatch_apply will be dispatched separately. If you want to chunk the work in GCD, you have to do this yourself, striding through the indices of the for loop, and then, inside the dispatch_apply block, you would iterate through the subset of indices yourself.

The net effect is that OpenMP will group the work into the minimum number of chunks for the desired number of threads, whereas your GCD example, in the absence of striding, there are an extraordinary number of separate dispatches, with the dispatch overhead quickly offsetting any benefits offered by parallelism.

In GCD, we remedy this by “striding”, increasing the amount of work performed on each thread. Right now, each iteration of dispatch_apply (known as concurrentPerform in Swift) is performing a negligible amount of work. And, in GCD, especially, if you have very little work on each iteration, the modest threading overhead can outweigh any performance benefits from multi-threading.

For example, consider N of 30. That will result in 536 million (2^N-1) dispatches in the dispatch_apply example. Yes, dispatch_apply will mitigate the thread-explosion risk (limiting the number of active threads to the number of cores on the device), but it will not “chunk” the work for you. You have to do this yourself.

For more information about striding, see the “Improving on Loop Code” section of Apple’s legacy Concurrency Programming Guide, which says:

Listing 5-3 shows how to replace a for loop with its dispatch-based equivalent. The block or function you pass to dispatch_apply or dispatch_apply_f must take an integer value indicating the current loop iteration. In this example, the code simply prints the current loop number to the console.

Listing 5-3 Replacing a for loop without striding
queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply(count, queue, ^(size_t i) {
    printf("%u\n", i);
});
Although the preceding example is a simplistic one, it demonstrates the basic techniques for replacing a loop using dispatch queues. And although this can be a good way to improve performance in loop-based code, you must still use this technique discerningly. Although dispatch queues have very low overhead, there are still costs to scheduling each loop iteration on a thread. Therefore, you should make sure your loop code does enough work to warrant the costs. Exactly how much work you need to do is something you have to measure using the performance tools.

A simple way to increase the amount of work in each loop iteration is to use striding. With striding, you rewrite your block code to perform more than one iteration of the original loop. You then reduce the count value you specify to the dispatch_apply function by a proportional amount. Listing 5-4 shows how you might implement striding for the loop code shown in Listing 5-3. In Listing 5-4, the block calls the printf statement the same number of times as the stride value, which in this case is 137. (The actual stride value is something you should configure based on the work being done by your code.) Because there is a remainder left over when dividing the total number of iterations by a stride value, any remaining iterations are performed inline.

Listing 5-4 Adding a stride to a dispatched for loop
int stride = 137;
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);

dispatch_apply(count / stride, queue, ^(size_t idx){
    size_t j = idx * stride;
    size_t j_stop = j + stride;
    do {
       printf("%u\n", (unsigned int)j++);
    } while (j < j_stop);
});

size_t i;
for (i = count - (count % stride); i < count; i++)
   printf("%u\n", (unsigned int)i);
There are some definite performance advantages to using strides. In particular, strides offer benefits when the original number of loop iterations is high, relative to the stride. Dispatching fewer blocks concurrently means that more time is spent executing the code of those blocks than dispatching them. As with any performance metric though, you may have to play with the striding value to find the most efficient value for your code.

Their example is not perfect, but it illustrates the idea.