In short, my question is: Should Sysbench eliminate the effect of cache when measuring the memory read/write performance, similar to how the effect of memory is eliminated when measuring the disk performance?
If the answer is no, does that mean Sysbench cares only about the final performance, no matter with cache or not?
If the answer is yes, did Sysbench disable the cache anywhere (and I missed that), or didn't do so?
P.S. By "affect of cache", I mean: when the user-defined memory_block_size is smaller than the cache size, and the whole memory block (or a big part of it) is loaded into the CPU cache, thus the memory performance is affected by the cache.
===
And here's some background information:
I am trying to run the memory benchmark in Sysbench, and this is how Sysbench does random memory access test:
int event_rnd_read(sb_event_t *req, int tid)
{
(void) req; /* unused */
for (ssize_t i = 0; i <= max_offset; i++)
{
size_t offset = (size_t) sb_rand_default(0, max_offset);
size_t val = SIZE_T_LOAD(buffers[tid] + offset);
(void) val; /* unused */
}
return 0;
}
The MACRO SIZE_T_LOAD expanded to:
# define SIZE_T_LOAD(ptr) ck_pr_load_32((uint32_t *)(ptr))
when sizeof(size_t)
is 4 bytes. ck_pr_load_32
is an atomic memory load function that can hardly be optimized by the compiler according to this link. And the max_offset
is set to memory_block_size / SIZEOF_SIZE_T - 1;
where the memory_block_size
is in most cases set to somewhere near 4KB.
All the code above is copied from https://github.com/akopytov/sysbench/blob/master/src/tests/memory/sb_memory.c
So, as far as I can see, Sysbench is not doing anything special to eliminate the effect of cache in their memory random read test. Is that true? If yes, then is that reasonable?
It's pretty standard in memory benchmarking to make graphs of performance vs. array size, to look at how things fall off as you exceed the sizes of various levels of cache. So no, Sysbench shouldn't try to defeat cache.
If users don't want cache effects, they should specify a large buffer. Cache is part of the memory hierarchy, so it's useful to be able to measure cases where it helps.
Even if you wanted to defeat cache without using a larger buffer, there's no portable efficient way to do that which. The only thing that works well without introducing huge overhead is using a larger buffer.
On recent x86 CPUs, clflushopt
after each read could evict them again, but that has to make sure the cache line is evicted from any/all cores, so it's more like a store operation. And not all x86 CPUs support it, so using it where available would make a non-level playing field for benchmarking the same buffer size on two different CPUs.
Disk storage is different for a couple reasons: