I'm running a simple program where:
mem_inst_retired.all_loads:k,mem_inst_retired.all_stores:k -I 200 -p <pid>
Here is the minimal test code:
void access_memory(char *memory) {
// Pin thread to CPU 1
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(1, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
std::mt19937 gen(std::random_device{}());
std::uniform_int_distribution<size_t> dist(0, 2GB - 500);
char buffer[500];
while (!should_stop) {
size_t offset = dist(gen);
memcpy(buffer, memory + offset, 500);
buffer[0]++;
}
}
Questions:
I observe that changing the sampling interval from 200ms to 5s changes the measured kernel operations from ~105 to ~107 operations per interval.
Running our test program in taskset -c 1
sudo perf record -e mem_inst_retired.all_loads:k -p <PID> -- sleep 20
# Total Lost Samples: 0
#
# Samples: 15 of event 'mem_inst_retired.all_loads:k'
# Event count (approx.): 30000045
#
# Overhead Command Shared Object Symbol
# ........ ......... ................. ..................................
#
13.33% perf-ldst [kernel.kallsyms] [k] decay_load
13.33% perf-ldst [kernel.kallsyms] [k] read_tsc
13.33% perf-ldst [kernel.kallsyms] [k] update_process_times
13.33% perf-ldst [kernel.kallsyms] [k] update_vsyscall
6.67% perf-ldst [kernel.kallsyms] [k] __raw_spin_lock_irqsave
6.67% perf-ldst [kernel.kallsyms] [k] __update_load_avg_se
6.67% perf-ldst [kernel.kallsyms] [k] _raw_spin_lock_irqsave
6.67% perf-ldst [kernel.kallsyms] [k] irq_work_run_list
6.67% perf-ldst [kernel.kallsyms] [k] timekeeping_adjust.constprop.0
6.67% perf-ldst [kernel.kallsyms] [k] timekeeping_advance
6.67% perf-ldst [kernel.kallsyms] [k] xas_move_index
It isn't clear why the timer/scheduler interrupts and management is attributed towards the running process. These are system maintenance tasks that would happen regardless of which process is running and not really being done "on behalf of" our process like a syscall would be.
Here is the simple code that does the accesses and measures load/store activity: https://gist.github.com/VinayBanakar/8cd04c5fa03a6895292498d3e3687aac
Interrupt and exception handlers don't context-switch away from the current
task (until/unless the scheduler decides they do), so the HW performance counters stay programmed (in your case to count kernel store instructions) while timer and external hardware interrupt handlers are running.
Linux's perf subsystem virtualizes the PMU (when not in full-system mode) the same way it virtualizes other process context, by saving/restoring on context switch. The per-core variable current
points to the task that's currently executing on this core. (And whose page tables CR3 points to; there always has to be a current
task.)
The interrupt-handler instructions aren't part of the work your program is doing, but your process is current
while they run so you get the counts. To avoid that, interrupt handlers would have to save/restore the PMU counters if any per-task events were being counted. The check for in-use perf events would cost performance all the time, not just when perf
was running, which is bad. (Extra cache misses in something you want to be very lightweight)
See /proc/interrupts
for per-CPU interrupt counts. You can check it before and after your run to see how many of which kinds of interrupts happened on the core you had your process pinned to.
perf stat
generates minimal interrupts since the counters can be programmed with as high a limit as they support, so overflow events are very infrequent. Unlike perf record
where you want frequent-enough interrupts to collect meaningful statistics. (PEBS writes to a buffer instead of interrupting, but I've read that the PEBS buffers are normally very small, like 1 record.)
Page faults are a common type of exception that causes kernel code to run. e.g. zeroing new pages of stack, BSS, or heap (mmap(MAP_ANONYMOUS)
) when you first write them. (If the first access is a read, it read will copy-on-write map them to a system-wide shared page of zeros.) So you probably have a few page faults that involve storing a whole page of zeroes by the kernel as you touch new stack space, but otherwise your scattered reads should just result in stores to page-table data structures, and kernel bookkeeping and function call overhead.
But in your case, if you do all your init before perf
attaches, then you probably won't be measuring any page faults. Other future readers that use perf stat ./workload
would be generating page faults.
As far as I know, interrupt handlers (and the scheduler) are the only kernel code that run without a context switch.
Interrupt handlers can include IPI (inter-processor interrupt) handlers for things like TLB shootdowns if a thread of that process was recently running on this core even if it isn't right now. Linux runs other work in schedulable kernel tasks, which are proper tasks with a PID just like user-space processes.