linux performance scheduling perf infiniband

Perf stat counts context-switches in what way?

perf stat displays some interesting statistics that can be gathered from examining hardware and software counters. In my research, I couldn't find any reliable information about what counts as a context-switch in perf stat. In spite of my efforts, I was unable to understand the kernel code in its entirety. Suppose my InfiniBand network application calls a blocking read system call in the event mode 2000 times and perf stat counts 1,241 context switches. The context-switches refer to either the schedule-in process or the schedule-out process, or both?

The __schedule() function (kernel/sched/core.c) increments the switch_count counter whenever prev != next. It seems that perf stats' context-switches include involuntary switches as well as voluntary switches.

It seems to me that only deschedule events are counted if the current context runs the schedule code and increases the nvcsw and nivcsw counters in the task_struct.

output from perf stat -- my_application:

         1,241      context-switches

Meanwhile, if I only count the sched:sched_switch event the output is close to the expected number.

output from perf stat -e sched:sched_switch -- my_application:

         2,168      sched:sched_switch

Is there a difference between context-switches and the sched_switch- event?

Solution

I think you only get a count for context-switches if a different task actually runs on a core that was running one of your threads. A read() that blocks, but resumes before any user-space code from any other task runs on the core, probably won't count.

Just entering the kernel at all for a system-call clearly doesn't count; perf stat ls only counts one context-switch in a largish directory for me, or zero if I ls a smaller directory like /. I get much higher counts, like 711 for a recursive ls of a directory that I hadn't accessed recently, on a magnetic HDD. So it spent significant time waiting for I/O, and maybe running bottom-half interrupt handlers.

The fact that the count can be odd means it's not counting both deschedule and re-schedule separately; since I'm looking at counts for a single-threaded process that eventually exited, if it was counting both the count would have to be even.

I expect the counting is done when schedule() decides that current should change to point to a new task that isn't this one. (current is the Linux kernel's per-core variable that points to the task_struct of the current task, e.g. a user-space thread.) So every time that happens to a thread that's part of your process, you get 1 count.

Indeed, the OP helpfully tracked down the source code; it's in __schedule in kernel/sched/core.c. For example in Linux 6.1

static void __sched notrace __schedule(unsigned int sched_mode)
{
    struct task_struct *prev, *next;
    unsigned long *switch_count;
    // and some other declarations I omitted
      ...
      cpu = smp_processor_id();
      rq = cpu_rq(cpu);   // stands for run queue
    prev = rq->curr;
      ...
    switch_count = &prev->nivcsw;        // either Num InVoluntary CSWs I think
      ...
    if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
        ...
        switch_count = &prev->nvcsw;     // or     Num Voluntary CSWs
    }

    next = pick_next_task(rq, prev, &rf);
      ...
    if (likely(prev != next)) {
        ...
        ++*switch_count;        //// INCREMENT THE SELECTED COUNTER
        ...
        trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);

        // then make some function calls to actually do the context switch 
        ...
    }

I would guess the context-switches perf event sums both involuntary and voluntary switches away from a thread. (Assuming that's what nv and niv stand for.)