c++performance x86 cpu-architecture perf

What causes kernel memory operations in perf stats for an userspace-only process?

I'm running a simple program where:

A thread pinned to CPU 1 performs random reads from a pre-allocated and initialized 2GB memory region and no system calls are made during the memory access loop.
Perf running in its own separate process measuring mem_inst_retired.all_loads:k,mem_inst_retired.all_stores:k -I 200 -p <pid>

Here is the minimal test code:

void access_memory(char *memory) {
    // Pin thread to CPU 1
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(1, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);

    std::mt19937 gen(std::random_device{}());
    std::uniform_int_distribution<size_t> dist(0, 2GB - 500);
    char buffer[500];

    while (!should_stop) {
        size_t offset = dist(gen);
        memcpy(buffer, memory + offset, 500);
        buffer[0]++; 
    }
}

Questions:

Why are there any kernel memory operations when the process is only doing userspace memory reads? While this could be an observer effect, could the interrupts generated by PMU (Performance Monitoring Unit) for sampling be causing the kernel to perform memory operations that get attributed to the access process? I skimmed through SDM but couldn't find a good answer.
How does perf attribute kernel-mode operations to a specific process? For syscalls, it's clear the kernel is working "on behalf of" the process, but what about kernel tasks like scheduling, memory compaction, or load balancing that might affect the process? Where does it draw the line?

I observe that changing the sampling interval from 200ms to 5s changes the measured kernel operations from ~10⁵ to ~10⁷ operations per interval.

Running our test program in taskset -c 1

sudo perf record -e mem_inst_retired.all_loads:k -p <PID> -- sleep 20

# Total Lost Samples: 0
#
# Samples: 15  of event 'mem_inst_retired.all_loads:k'
# Event count (approx.): 30000045
#
# Overhead  Command    Shared Object      Symbol                            
# ........  .........  .................  ..................................
#
    13.33%  perf-ldst  [kernel.kallsyms]  [k] decay_load
    13.33%  perf-ldst  [kernel.kallsyms]  [k] read_tsc
    13.33%  perf-ldst  [kernel.kallsyms]  [k] update_process_times
    13.33%  perf-ldst  [kernel.kallsyms]  [k] update_vsyscall
     6.67%  perf-ldst  [kernel.kallsyms]  [k] __raw_spin_lock_irqsave
     6.67%  perf-ldst  [kernel.kallsyms]  [k] __update_load_avg_se
     6.67%  perf-ldst  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     6.67%  perf-ldst  [kernel.kallsyms]  [k] irq_work_run_list
     6.67%  perf-ldst  [kernel.kallsyms]  [k] timekeeping_adjust.constprop.0
     6.67%  perf-ldst  [kernel.kallsyms]  [k] timekeeping_advance
     6.67%  perf-ldst  [kernel.kallsyms]  [k] xas_move_index

It isn't clear why the timer/scheduler interrupts and management is attributed towards the running process. These are system maintenance tasks that would happen regardless of which process is running and not really being done "on behalf of" our process like a syscall would be.

Here is the simple code that does the accesses and measures load/store activity: https://gist.github.com/VinayBanakar/8cd04c5fa03a6895292498d3e3687aac

Solution

Interrupt and exception handlers don't context-switch away from the current task (until/unless the scheduler decides they do), so the HW performance counters stay programmed (in your case to count kernel store instructions) while timer and external hardware interrupt handlers are running.

Linux's perf subsystem virtualizes the PMU (when not in full-system mode) the same way it virtualizes other process context, by saving/restoring on context switch. The per-core variable current points to the task that's currently executing on this core. (And whose page tables CR3 points to; there always has to be a current task.)

The interrupt-handler instructions aren't part of the work your program is doing, but your process is current while they run so you get the counts. To avoid that, interrupt handlers would have to save/restore the PMU counters if any per-task events were being counted. The check for in-use perf events would cost performance all the time, not just when perf was running, which is bad. (Extra cache misses in something you want to be very lightweight)

See /proc/interrupts for per-CPU interrupt counts. You can check it before and after your run to see how many of which kinds of interrupts happened on the core you had your process pinned to.

perf stat generates minimal interrupts since the counters can be programmed with as high a limit as they support, so overflow events are very infrequent. Unlike perf record where you want frequent-enough interrupts to collect meaningful statistics. (PEBS writes to a buffer instead of interrupting, but I've read that the PEBS buffers are normally very small, like 1 record.)

Page faults are a common type of exception that causes kernel code to run. e.g. zeroing new pages of stack, BSS, or heap (mmap(MAP_ANONYMOUS)) when you first write them. (If the first access is a read, it read will copy-on-write map them to a system-wide shared page of zeros.) So you probably have a few page faults that involve storing a whole page of zeroes by the kernel as you touch new stack space, but otherwise your scattered reads should just result in stores to page-table data structures, and kernel bookkeeping and function call overhead.

But in your case, if you do all your init before perf attaches, then you probably won't be measuring any page faults. Other future readers that use perf stat ./workload would be generating page faults.

As far as I know, interrupt handlers (and the scheduler) are the only kernel code that run without a context switch.

Interrupt handlers can include IPI (inter-processor interrupt) handlers for things like TLB shootdowns if a thread of that process was recently running on this core even if it isn't right now. Linux runs other work in schedulable kernel tasks, which are proper tasks with a PID just like user-space processes.