algorithm performance optimization x86 cpu-cache

Cache friendly offline random read

Consider this function in C++:

void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
    while (b1 != b2) {
        // assert(0 <= *b1 && *b1 < a2 - a1)
        *o++ = a1[*b1++];
    }
}

Its purpose should be clear enough. Unfortunately, b1 contains random data and trash the cache, making foo the bottleneck of my program. Is there anyway I can optimize it?

This is an SSCCE that should resemble my actual code:

#include <iostream>
#include <chrono>
#include <algorithm>
#include <numeric>

namespace {
    void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
        while (b1 != b2) {
            // assert(0 <= *b1 && *b1 < a2 - a1)
            *o++ = a1[*b1++];
        }
    }

    constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
    uint32_t data[max_n], index[max_q], result[max_q];
}

int main() {
    uint32_t seed = 0;
    auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
    std::generate_n(data, max_n, rng);
    std::generate_n(index, max_q, [rng]() { return rng() % max_n; });

    auto t1 = std::chrono::high_resolution_clock::now();
    foo(data, data + max_n, index, index + max_q, result);
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;

    uint32_t hash = 0;
    for (unsigned i = 0; i < max_q; i++)
        hash += result[i] ^ (i << 8) ^ i;
    std::cout << hash << std::endl;
}

This is not Cache-friendly copying of an array with readjustment by known index, gather, scatter, which asks about random writes and assumes b is a permutation.

Solution

First, let's take a look at the actual performance of the code above:

$ sudo perf stat ./offline-read
0.123023
1451229184

 Performance counter stats for './offline-read':

        184.661547      task-clock (msec)         #    0.997 CPUs utilized          
                 3      context-switches          #    0.016 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               717      page-faults               #    0.004 M/sec                  
       623,638,834      cycles                    #    3.377 GHz                    
       419,309,952      instructions              #    0.67  insn per cycle         
        70,803,672      branches                  #  383.424 M/sec                  
            16,895      branch-misses             #    0.02% of all branches        

       0.185129552 seconds time elapsed

We are getting a low IPC of 0.67, probably caused almost entirely by load-misses to DRAM⁵. Let's confirm:

sudo ../pmu-tools/ocperf.py stat -e cycles,LLC-load-misses,cycle_activity.stalls_l3_miss ./offline-read
perf stat -e cycles,LLC-load-misses,cpu/event=0xa3,umask=0x6,cmask=6,name=cycle_activity_stalls_l3_miss/ ./offline-read
0.123979
1451229184

 Performance counter stats for './offline-read':

       622,661,371      cycles                                                      
        16,114,063      LLC-load-misses                                             
       368,395,404      cycle_activity_stalls_l3_miss                                   

       0.184045411 seconds time elapsed

So ~370k cycles out of 620k are straight-up stalled on outstanding misses. In fact, the portion of cycles stalled this way in foo() is much higher, close to 90% since perf is also measuring the init and accumulate code which takes about a third of the runtime (but doesn't have significant L3 misses).

This is nothing unexpected, since we knew the random-read pattern a1[*b1++] was going to have essentially zero locality. In fact, the number of LLC-load-misses is 16 million¹, corresponding almost exactly to the 16 million random reads of a1.²

If we just assume 100% of foo() is spending waiting on memory access, we can get an idea of the total cost of each miss: 0.123 sec / 16,114,063 misses == 7.63 ns/miss. On my box, the memory latency is around 60 ns in the best case, so less than 8 ns per miss means we are already extracting a lot of memory-level parallelism (MLP): about 8 misses would have to be overlapped and in-flight on average to achieve this (even totally ignoring the additional traffic from the streaming load of b1 and streaming write of o).

So I don't think there are many tweaks you can apply to the simple loop to do much better. Still, two possibilities are:

Non-temporal stores for the writes to o, if your platform supports them. This would cut out the reads implied by RFO for normal stores. It should be a straight win since o is never read again (inside the timed portion!).
Software prefetching. Carefully tuned prefetching of a1 or b1 could potentially help a bit. The impact is going to be fairly limited, however, since we are already approaching the limits of MLP as described above. Also, we expect the linear reads of b1 to be almost perfectly prefetched by the hardware prefetchers. The random reads of a1 seem like they could be amenable to prefetching, but in practice the ILP in the loop leads to enough MLP though out-of-order processing (at least on big OoO processors like recent x86).

In the comments user harold already mentioned that he tried prefetching with only a small effect.

So since the simple tweaks aren't likely to bear much fruit, you are left with transforming the loop. One "obvious" transformation is to sort the indexes b1 (along with the index element's original position) and then do the reads from a1 in sorted order. This transforms the reads of a1 from completely random, to almost³ linear, but now the writes are all random, which is no better.

Sort and then unsort

The key problem is that the reads of a1 under control of b1 are random, and a1 is large you get a miss-to-DRAM for essentially every read. We can fix that by sorting b1, and then reading a1 in order to get a permuted result. Now you need to "un-permute" the result a1 to get the result in the final order, which is simply another sort, this time on the "output index".

Here's a worked example with the given input array a, index array b and output array o, and i which is the (implicit) position of each element:

  i =   0   1   2   3
  a = [00, 10, 20, 30]
  b = [ 3,  1,  0,  1]
  o = [30, 10, 00, 10] (desired result)

First, sort array b, with the original array position i as secondary data (alternately you may see this as sorting tuples (b[0], 0), (b[1], 1), ...), this gives you the sorted b array b' and the sorted index list i' as shown:

  i' = [ 2, 1, 3, 0]
  b' = [ 0, 1, 1, 3]

Now you can read the permuted result array o' from a under the control of b'. This read is strictly increasing in order, and should be able to operate at close to memcpy speeds. In fact you may be able to take advantage of wide contiguous SIMD reads and some shuffles to do several reads and once and move the 4-byte elements into the right place (duplicating some elements and skipping others):

  a  = [00, 10, 20, 30]
  b' = [ 0,  1,  1,  3]
  o' = [00, 10, 10, 30]

Finally, you de-permute o' to get o, conceptually simply by sorting o' on the permuted indexes i':

  i' = [ 2,  1,  3,  0]
  o' = [00, 10, 10, 30]
  i  = [ 0,  1,  2,  3]
  o  = [30, 10, 00, 10]

Finished!

Now this is the simplest idea of the technique and isn't particularly cache-friendly (each pass conceptually iterates over one or more 2^26-byte arrays), but it at least fully uses every cache line it reads (unlike the original loop which only reads a single element from a cache line, which is why you have 16 million misses even though the data only occupies 1 million cache lines!). All of the reads are more or less linear, so hardware prefetching will help a lot.

How much speedup you get probably large depends on how will you implement the sorts: they need to be fast and cache sensitive. Almost certainly some type of cache-aware radix sort will work best.

Here are some notes on ways to improve this further:

Optimize the amount of sorting

You don't actually need to fully sort b. You just want to sort it "enough" such that the subsequent reads of a under the control of b' are more or less linear. For example, 16 elements fit in a cache line, so you don't need to sort based on the last 4 bits at all: the same linear sequence of cache lines will be read anyways. You could also sort on even fewer bits: e.g., if you ignored the 5 least-significant bits, you'd read cache lines in an "almost linear" way, sometimes swapping two cache lines from the perfectly linear pattern like: 0, 1, 3, 2, 5, 4, 6, 7. Here, you'll still get the full benefit of the L1 cache (subsequent reads to a cache line will always hit), and I suspect such a pattern would still be prefetched well and if not you can always help it with software prefetching.

You can test on your system what the optimal number of ignored bits is. Ignoring bits has two benefits:

Less work to do in the radix search, either from fewer passes needed or needing fewer buckets in one or more passes (which helps caching).
Potentially less work to do to "undo" the permutation in the last step: if the undo by examining the original index array b, ignoring bits means that you get the same savings when undoing the search.

Cache block the work

The above description lays out everything in several sequential, disjoint passes that each work on the entire data set. In practice, you'd probably want to interleave them to get better caching behavior. For example, assuming you use an MSD radix-256 sort, you might do the first pass, sorting the data into 256 buckets of approximately 256K elements each.

Then rather than doing the full second pass, you might finish sorting only the first (or first few) buckets, and proceed to do the read of a based on the resulting block of b'. You are guaranteed that this block is contiguous (i.e., a suffix of the final sorted sequence) so you don't give up any locality in the read, and your reads will generally be cached. You may also do the first pass of de-permuting o' since the block of o' is also hot in the cache (and perhaps you can combine the latter two phases into one loop).

Smart De-permutation

One area for optimization is how exactly the de-permutation of o' is implemented. In the description above, we assume some index array i initially with values [0, 1, 2, ..., max_q] which is sorted along with b. That's conceptually how it works, but you may not need to actually materialize i right away and sort it as auxillary data. In the first pass of the radix sort, for example, the value of i is implicitly known (since you are iterating through the data), so it could be calculated for free⁴ and written out during the first pass without every having appeared in sorted order.

There may also be more efficient ways to do the "unsort" operation than maintaining the full index. For example, the original unsorted b array conceptually has all the information needed to do the unsort, but it is clear to me how to use to efficiently unsort.

Is it be faster?

So will this actually be faster than the naive approach? It depends largely on implementation details especially including the efficiency of the implemented sort. On my hardware, the naive approach is processing about ~140 million elements per second. Online descriptions of cache-aware radix sorts seem to vary from perhaps 200 to 600 million elements/s, and since you need two of those, the opportunity for a big speedup would seem limited if you believe those numbers. On other hand, those numbers are from older hardware, and for slightly more general searches (e.g,. for all 32 bits of the key, while we may be able to use as few as 16 bits).

Only a careful implementation will determine if it is feasible, and feasibility also depends on the hardware. For example, on hardware that can't sustain as much MLP, the sorting-unsorting approach becomes relatively more favorable.

The best approach also depends on the relative values of max_n and max_q. For example, if max_n >> max_q, then the reads will be "sparse" even with optimal sorting, so the naive approach would be better. On the other hand if max_n << max_q, then the same index will usually be read many times, so the sorting approach will have good read locality, the sorting steps will themselves have better locality, and further optimizations which handle duplicate reads explicitly may be possible.

Multiple Cores

It isn't clear from the question whether you are interested in parallelizing this. The naive solution for foo() already does admit a "straightforward" parallelization where you simply partition the a and b arrays into equal sized chunks, on for each thread, which would seem to provide a perfect speedup. Unfortunately, you'll probably find that you get much worse than linear scaling, because you'll be running into resource contention in the memory controller and associated uncore/offcore resources which are shared between all cores on a socket. So it isn't clear how much more throughput you'll get for a purely parallel random read load to memory as you add more cores⁶.

For the radix-sort version, most of the bottlenecks (store throughput, total instruction throughput) are in the core, so I expect it to scale reasonably with additional cores. As Peter mentioned in the comment, if you are using hyperthreading, the sort may have the additional benefit of good locality in the core local L1 and L2 caches, effectively letting each sibling thread use the entire cache, rather than cutting the effective capacity in half. Of course, that involves carefully managing your thread affinity so that sibling threads actually use nearby data, and not just letting the scheduler do whatever it does.

¹ You might ask why the LLC-load-misses isn't say 32 or 48 million, given that we also have to read all 16 million elements of b1 and then the accumulate() call reads all of result. The answer is that LLC-load-misses only counts demand misses that actually miss in the L3. The other mentioned read patterns are totally linear, so the prefetchers will always be bringing the line into the L3 before it is needed. These don't count as "LLC misses" by the definition perf uses.

² You might want to know how I know that the load misses all come from the reads of a1 in foo: I simply used perf record and perf mem to confirm that the misses were coming from the expected assembly instruction.

³ Almost linear because b1 is not a permutation of all indexes, so in principle there can be skipped and duplicate indexes. At the cache-line level, however, it is highly likely that every cache line will be read in-order since each element has a ~63% chance of being included, and a cache line has 16 4-byte elements, so there's only about a 1 in 10 million chance that any given cache has zero elements. So prefetching, which works at the cache line level, will work fine.

⁴ Here I mean that the calculation of the value comes for free or nearly so, but of course the write still costs. This is still much better than the "up-front materialization" approach, however, which first creates the i array [0, 1, 2, ...] needing max_q writes and then again needs another max_q writes to sort it in the first radix sort pass. The implicit materialization only incurs the second write.

⁵ In fact, the IPC of the actual timed section foo() is much lower: about 0.15 based on my calculations. The reported IPC of the entire process is an average of the IPC of the timed section and the initialization and accumulation code before and after which has a much higher IPC.

⁶ Notably, this is different from a how a dependent-load latency bound workflow scales: a load that is doing random read but can only have one load in progress because each load depends on the result of last scales very well to multiple cores because the serial nature of the loads doesn't use many downstream resources (but such loads can conceptually also be sped up even on a single core by changing the core loop to handle more than one dependent load stream in parallel).