cache management in C

I've been reading about caching in multicore systems and I am wondering if it is possible to have some control on what pages are in cache or not, when we program in C/C++.

For instance, I know that we can use a built-in function __builtin_prefetch to move data to cache, in order to reduce cache misses and therefore latency. I found it here: https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Other-Builtins.html#Other-Builtins

I also found this for x86 and x86-64 Intel:

#include <xmmintrin.h>
enum _mm_hint
{
  _MM_HINT_T0 = 3,
  _MM_HINT_T1 = 2,
  _MM_HINT_T2 = 1,
  _MM_HINT_NTA = 0
};
void _mm_prefetch(void *p, enum _mm_hint h);

Here: http://lwn.net/Articles/255364/

What other functions can we use that give us some cache "control". For instance, can we do anything related with cache page replacement? Or this is an exclusive function of the OS?

Thanks a lot!

Solution

Cache prefetching hints usually emit a special prefetch instruction which advises the prefetcher that this piece of memory is going to be needed in the near future. The prefetcher might (or might not) take this advice. So in that sense, software prefetching isn't really about "cache control" or "cache management".

To my knowledge, no current widespread instruction set architecture provides instructions for evicting a particular cache line, or for bringing a particular piece of memory into a cache line, for example. The point of the cache in most contemporary architectures is to be transparent to the programmer.

You can, however, write programs which are cache-friendly. It is all about spatial and temporal locality of both the data and the instructions:

Spatial locality with regard to the data means that you should strive for consecutive memory accesses which don't go too far apart from each other. This is the most natural thing to optimise for.
Spatial locality with regard to the instructions means that the jumps and the branches should not go too far in the code. Compilers and linkers should try to do that.
Temporal locality with regard to the data means accessing the same memory locations (although possibly not close to each other) in a single time-slice. The definition of how long this time-slice is can be tricky;
Temporal locality with regard to the instructions means that even if the code jumps across long distances, it jumps across the same locations in a single time-slice. This is generally very unintuitive and not very rewarding to optimise for.

Generally you should optimise the data and not so much the instruction locality. In most performance sensitive programs, the amount of the data far outweighs the amount of the code.

Further, with respect to multiple cores, you should try to avoid false sharing and use thread-local storage as much as possible. Keep in mind that each CPU core has its own dedicated cache and things like cache line bouncing between the caches of the cores can have a very negative effect.

To illustrate false sharing, consider the following code:

int counts[NUM_THREADS]; // a global array where each thread writes to its slot

...

for (int i = 0; i < NUM_THREADS; ++i) {
    spawn_thread(thread_start);
}

...

void thread_start(void)
{
    for (a_large_number_of_iterations) {
        int some_condition = some_calculation();

        if (some_condition) {
            counts[THREAD_ID]++;
        }
    }
}

Each of the threads modifies an element of the counts array with a high frequency. The trouble is, individual elements of the array are adjacent and large groups of them will fall on a single cache-line. With the typical cache line being 64 bytes and the typical int size being 4 bytes, this means that a single cache line can make room for 16 elements. When multiple cores update only their 4-byte count, they are also invalidating the corresponding cache line in the other cores, which will lead to bouncing the cache line between cores, even though the threads are seemingly using independent memory locations.