Search code examples
calgorithmperformancecpu-cache

Cache-friendly copying of an array with readjustment by known index, gather, scatter


Suppose we have an array of data and another array with indexes.

data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]

We want to create a new array from elements of data at position from index. Result should be

[4, 2, 5, 7, 3, 1]

Naive algorithm works for O(N) but it performs random memory access.

Can you suggest CPU cache friendly algorithm with the same complexity.

PS In my certain case all elements in data array are integers.

PPS Arrays might contain millions of elements.

PPPS I'm ok with SSE/AVX or any other x64 specific optimizations


Solution

  • Combine index and data into a single array. Then use some cache-friendly sorting algorithm to sort these pairs (by index). Then get rid of indexes. (You could combine merging/removing indexes with the first/last pass of the sorting algorithm to optimize this a little bit).

    For cache-friendly O(N) sorting use radix sort with small enough radix (at most half number of cache lines in CPU cache).

    Here is C implementation of radix-sort-like algorithm:

    void reorder2(const unsigned size)
    {
        const unsigned min_bucket = size / kRadix;
        const unsigned large_buckets = size % kRadix;
        g_counters[0] = 0;
    
        for (unsigned i = 1; i <= large_buckets; ++i)
            g_counters[i] = g_counters[i - 1] + min_bucket + 1;
    
        for (unsigned i = large_buckets + 1; i < kRadix; ++i)
            g_counters[i] = g_counters[i - 1] + min_bucket;
    
        for (unsigned i = 0; i < size; ++i)
        {
            const unsigned dst = g_counters[g_index[i] % kRadix]++;
            g_sort[dst].index = g_index[i] / kRadix;
            g_sort[dst].value = g_input[i];
            __builtin_prefetch(&g_sort[dst + 1].value, 1);
        }
    
        g_counters[0] = 0;
    
        for (unsigned i = 1; i < (size + kRadix - 1) / kRadix; ++i)
            g_counters[i] = g_counters[i - 1] + kRadix;
    
        for (unsigned i = 0; i < size; ++i)
        {
            const unsigned dst = g_counters[g_sort[i].index]++;
            g_output[dst] = g_sort[i].value;
            __builtin_prefetch(&g_output[dst + 1], 1);
        }
    }
    

    It differs from radix sort in two aspects: (1) it does not do counting passes because all counters are known in advance; (2) it avoids using power-of-2 values for radix.

    This C++ code was used for benchmarking (if you want to run it on 32-bit system, slightly decrease kMaxSize constant).

    Here are benchmark results (on Haswell CPU with 6Mb cache):

    benchmark results

    It is easy to see that small arrays (below ~2 000 000 elements) are cache-friendly even for naive algorithm. Also you may notice that sorting approach starts to be cache-unfriendly at the last point on diagram (with size/radix near 0.75 cache lines in L3 cache). Between these limits sorting approach is more efficient than naive algorithm.

    In theory (if we compare only memory bandwidth needed for these algorithms with 64-byte cache lines and 4-byte values) sorting algorithm should be 3 times faster. In practice we have much smaller difference, about 20%. This could be improved if we use smaller 16-bit values for data array (in this case sorting algorithm is about 1.5 times faster).

    One more problem with sorting approach is its worst-case behavior when size/radix is close to some power-of-2. This may be either ignored (because there are not so many "bad" sizes) or fixed by making this algorithm slightly more complicated.

    If we increase number of passes to 3, all 3 passes use mostly L1 cache, but memory bandwidth is increased by 60%. I used this code to get experimental results: TL; DR. After determining (experimentally) the best radix value, I got somewhat better results for sizes greater than 4 000 000 (where 2-pass algorithm uses L3 cache for one pass) but somewhat worse results for smaller arrays (where 2-pass algorithm uses L2 cache for both passes). As it may be expected, performance is better for 16-bit data.

    Conclusion: performance difference is much smaller than difference in complexity of algorithms, so naive approach is almost always better; if performance is very important and only 2 or 4 byte values are used, sorting approach is preferable.