c++performance x86 histogram cpu-architecture

Why is computing the histogram of a sorted array slower?

Consider this code on Godbolt

The goal is to benchmark a simple histogram function:

[[gnu::noinline]] static void histogram(int const *a, int n, int *h) {
  for (int i = 0; i < n; ++i)
    h[a[i]]++;
}

The benchmark measures the average time elapsed during the histogram invocation when the input is random data, and when the input is sorted.

This is the output of this program on my machine:

g++ test.cpp -std=c++20 -O3 -march=native -o test.out && ./test.out 1000000 512 100 17
N (number of elements) = 1000000.
K (histogram size) = 512.
T (number of samples) = 100.
R (random seed) = 17.
time per element (random) : 0.751250 ns.
time per element (sorted) : 2.032259 ns.

I run this experiment on a 4-core sandy bridge intel processor i5-2320. Why is the histogram algorithm more than 2 times slower when the input is sorted?

Solution

Storing/reloading the same element repeatedly creates a serial dependency chain that's too long for out-of-order exec to fully overlap.

If the value-range isn't too big, the random order probably still hits in cache for most increments and is only limited by throughput, not store-forwarding latency bottlenecks.

Some modern CPUs (Zen 2, Zen 4, and Ice Lake at least) have zero-latency store-forwarding in some cases, which may include this case where the index is reloaded with the same value. This might be why Godbolt runs show the sorted test being faster.

The benchmark warms up the CPU to turbo frequency with very slow rand() calls, and pre-zeros the arrays because they're std::vector<int>, so the problems described in Idiomatic way of performance evaluation? aren't happening here, I don't think.

See Methods to vectorise histogram in SIMD? for a trick that helps: use multiple arrays of counts, and sum them at the end. (That part can use SIMD, the actual counting can't unless you have AVX-512 for scatter/gather, and even then having multiple elements that map to the same bucket in the same vector is slower.)

This wouldn't be necessary on CPUs where memory-renaming works to get zero-latency store-forwarding in this case, except on very wide CPUs that can do more than one load + memory-destination increment per clock.

If you know your data is sorted, you could count runs of the same element and do one += count, although that can easily lead to branch mispredicts unless you do something like _mm_cmpeq_epi32(load(ptr), set1(*ptr)) ; _mm_movemask_ps for a couple vectors and check that popcnt is < 8 elements, so runs shorter than that always branch the same way. (The actual compare could be on mask < 0xff, but you'd add the popcount. Or 31-std::countl_zero, or std::countr_zero(mask+1) so you can use tzcnt / bsf.