Search code examples
clinux-kernelarmbenchmarking

what is ChaseNS in this pointer-chasing benchmark


Trying to figure the output of following benchmark from google:

https://github.com/google/multichase

and output is like:

   ./multiload -s 16 -n 5 -t 16 -m 512M -c chaseload -l stream-sum
    Samples , Byte/thd  , ChaseThds , ChaseNS   , ChaseMibs , ChDeviate , LoadThds  , LdMaxMibs , LdAvgMibs , LdDeviate , ChaseArg  , MemLdArg
    5       , 536870912     , 1         , 212.726   , 36        , 0.017     , 15        , 17427     , 17331     , 0.012     , chaseload , stream-sum

What is ChaseNS here, is it the time taken to access every 16th byte from an array of 512MB ?

and

ChaseMibs is bandwidth we get while accessing the addresses at this 16th byte ?


Solution

  • I'd assume it's nanoseconds per something. Perhaps per load (dereference). 212 ns is a long time to wait for a cache-miss load, but with contention from multiple cores it's maybe plausible?

    is it the time taken to access every 16th byte from an array of 512MB?

    This is a pointer-chasing microbenchmark, like p = p->next, so you're measuring load latency by making each load-address dependent on the previous load's result. So hopefully the access pattern is not regular, otherwise hardware prefetching would defeat it, by having the next thing to load already in local L1d cache before the load-address is known.

    e.g. make an array of pointers to pointers (like struct foo { struct foo *next; };) with each one pointing to the next, then shuffle it, so iterating over that linked list touches cache lines in a random order within that 512 MiB working set.


    I guess ChaseThds = 1 / LoadThds = 15 saying that we have 1 thread chasing pointers, and 15 other threads trying to saturate memory bandwidth? That would do it. They're not waiting for a load to complete before starting the next one, so each of those load threads can achieve some memory-level parallelism (doing memset or memcpy, or memchr or whatever), and we can see they're achieving 17331 MiB/s.

    Oh, "stream-sum" is probably A[i] = B[i] + C[i], like the Dr. Bandwidth's STREAM benchmark, perhaps exactly that code.

    (I doubt they really mean Mib mibi-bits. It's weird to be precise about using SI binary-prefix Mi, but then use b (bits) instead of B bytes. But 17 GiB/s is a typical memory-bandwidth number for a modern-ish system.)

    I don't know exactly what this benchmark does to construct its data, and if the load threads are reading their own block of memory or the same array of pointers. I didn't look at the github page for it, the name alone and the results make the basics pretty clear, that's it's a memory latency benchmark done the usual way.


    Dr. Bandwidth commented that this benchmark uses a geometric mean to calculate the ChaseNS. This is usually not what you want for average latency. Min/max / arithmetic mean +- standard deviation is typically more meaningful. And stuff like looking at 90th percentile / 99th percentile worst cases is useful to explore the long tail for real-time use-cases.