c++linux raspberry-pi real-time raspberry-pi4

Linux PREEMPT_RT: SCHED_OTHER performs better than SCHED_FIFO. Why?

I'm experimenting with the realtime capabilities of the Raspberry Pi 3/4. I've written the following C++ program to test.


// Compile with:
// g++ realtime_task.cpp -o realtime_task -lrt && sudo setcap CAP_SYS_NICE+ep realtime_task

#include <cstdio>
#include <sched.h>
#include <unistd.h>
#include <fcntl.h>
#include <cstdbool>
#include <chrono>
#include <algorithm> 

using namespace std;
using namespace chrono;
using namespace chrono_literals;

int main(int argc, char **argv)
{   
    // allocate this process to the 4th core (core 3)
    pid_t pid = getpid();
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(3, &cpuset);
    int result = 0;
    result = sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset);
    if (result != 0)
    {
        perror("`sched_setaffinity` failed");
    }

    struct sched_param param;

    // Use SCHED_FIFO
    param.sched_priority = 99;
    result = sched_setscheduler(pid, SCHED_FIFO, &param);

    // Use SCHED_OTHER
    // param.sched_priority = 0;
    // result = sched_setscheduler(pid, SCHED_OTHER, &param);

    if (result != 0)
    {
        perror("`sched_setscheduler` failed");
    }

    uint32_t count = 0;
    uint32_t total_loop_time_us = 0;
    uint32_t min_loop_time_us = numeric_limits<uint32_t>::max();
    uint32_t avg_loop_time_us = 0;
    uint32_t max_loop_time_us = 0;
    
    while(true)
    {
        count++;

        auto start = steady_clock::now();

        auto loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
        while(loop_time_us < 500)
        { 
            loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
        }

        min_loop_time_us = min((uint32_t)loop_time_us, (uint32_t)min_loop_time_us);
        total_loop_time_us += loop_time_us;
        avg_loop_time_us = total_loop_time_us / count;
        max_loop_time_us = max((uint32_t)loop_time_us, (uint32_t)max_loop_time_us);

        if ((count % 1000) == 0)
        {
            printf("%u %u %u\r", min_loop_time_us, avg_loop_time_us, max_loop_time_us);
            fflush(stdout);
        }
    }

    return 0;
}

I've patched the kernel with the PREEMPT_RT patch. uname reports 5.15.84-v8+ #1613 SMP PREEMPT and everything runs fine.

Kernel command line arguments have isolcpus=3 irqaffinity=0-2 to isolate the 4th core (core 3) and reserve it for the program above. I can see in htop that my program is the only process running on the 4th core (core 3).

When using the SCHED_FIFO policy, it reports the following minimum, average, and maximum loop times...

MIN AVG MAX
500 522 50042

.. and htop reports:

CPU▽ CTXT     PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
  3     1   37238 pi         RT   0  4672  1264  1108 R 97.5  0.0  0:07.57 ./realtime_task

When using the SCHED_OTHER policy, it reports the following minimum, average, and maximum loop times...

MIN AVG MAX
500 500 524

.. and htop reports:

CPU▽ CTXT     PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
  3     0   36065 pi         20   0  4672  1260  1108 R 100.  0.0  1:30.16 ./realtime_task

This is the opposite of what I expect. I expect SCHED_FIFO to give me the lower maximum loop time and fewer context switches. Why am I getting these results?

Solution

The problem turned out to be realtime throttling. When throttling occurs a message appears in the dmesg output.

Once disabled with echo -1 > /proc/sys/kernel/sched_rt_runtime_us, the SCHED_FIFO policy worked as expected. When a stressor program is introduced on cores 0~2, then SCHED_FIFO performs much better than SCHED_OTHER.

However, there is a better way to avoid realtime throttling without having to disable it for the entire system. For the program listing in the original question change this code ...

auto loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
while(loop_time_us < 500)
{ 
    loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
}

... to this code ...

this_thread::sleep_for(400us);
auto loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
while(loop_time_us < 500)
{
    loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
}

The this_thread::sleep_for call will prevent the process from consuming the entire allotted time in /proc/sys/kernel/sched_rt_period_us and thus prevent realtime throttling. Since sleep_for is not very precise, you just don't sleep for the full 500 microseconds, and use the while(loop_time_us < 500) loop to fill the remaining 100 microseconds with a more precise spin-wait.

This method also prevents the realtime core from turning into a space heater.