Search code examples
clinuxmacosunixfwrite

How to output as fast as possible a fixed buffer?


Sample code:

#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <pthread.h>

int
main (int argc, char **argv)
{

  unsigned char buffer[128];
  char buf[0x4000];
  setvbuf (stdout, buf, _IOFBF, 0x4000);
  fork ();
  fork ();

  pthread_t this_thread = pthread_self ();

  struct sched_param params;

  params.sched_priority = sched_get_priority_max (SCHED_RR);

  pthread_setschedparam (this_thread, SCHED_RR, &params);


  while (1)
    {
      fwrite (&buffer, 128, 1, stdout);
    }
}

This program opens 4 threads and outputs on stdout the contents of "buffer" which is 128 bytes or 16 long ints on a 64 bit cpu.

If I then run:

./writetest | pv -ptebaSs 800G >/dev/null

I get a speed of about 7.5 GB/s.

Incidentally, that is the same speed I get if I do:

$ mkfifo out
$ dd if=/dev/zero bs=16384 >out &
$ dd if=/dev/zero bs=16384 >out &
$ dd if=/dev/zero bs=16384 >out &
$ dd if=/dev/zero bs=16384 >out &
pv <out -ptebaSs 800G >/dev/null

Is there any way to make this faster? Note. the buffer in the real program is not filled with zeroes.

my curiosity is to understand how much data can a single program (mutithreaaded or multiprocess) output

It looks like 4 people didn't understand this simple question. I even put in bold the reason of the question.


Solution

  • Well it seems that linux scheduler and IO priorities played had a big role in the slowdown.

    Also, spectre and other cpu vunerability mitigations came to play.

    After further optimization, to achieve a faster speed I had to tune this things:

    1) program nice level (nice -n -20)
    2) program ionice level (ionice -c 1 -n 7)
    3) pipe size increased 8 times.
    4) disable cpu mitigations by adding "pti=off spectre_v2=off l1tf=off" in kernel command line
    5) tuning the linux scheduler
    
    echo -n -1 >/proc/sys/kernel/sched_rt_runtime_us
    echo -n -1 >/proc/sys/kernel/sched_rt_period_us
    echo -n -1 >/proc/sys/kernel/sched_rr_timeslice_ms
    echo -n 0 >/proc/sys/kernel/sched_tunable_scaling
    

    Now the program outputs (on the same pc) 8.00 GB/sec!

    If you have other ideas you're welcome to contribute.