Sample code:
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <pthread.h>
int
main (int argc, char **argv)
{
unsigned char buffer[128];
char buf[0x4000];
setvbuf (stdout, buf, _IOFBF, 0x4000);
fork ();
fork ();
pthread_t this_thread = pthread_self ();
struct sched_param params;
params.sched_priority = sched_get_priority_max (SCHED_RR);
pthread_setschedparam (this_thread, SCHED_RR, ¶ms);
while (1)
{
fwrite (&buffer, 128, 1, stdout);
}
}
This program opens 4 threads and outputs on stdout the contents of "buffer" which is 128 bytes or 16 long ints on a 64 bit cpu.
If I then run:
./writetest | pv -ptebaSs 800G >/dev/null
I get a speed of about 7.5 GB/s.
Incidentally, that is the same speed I get if I do:
$ mkfifo out
$ dd if=/dev/zero bs=16384 >out &
$ dd if=/dev/zero bs=16384 >out &
$ dd if=/dev/zero bs=16384 >out &
$ dd if=/dev/zero bs=16384 >out &
pv <out -ptebaSs 800G >/dev/null
Is there any way to make this faster? Note. the buffer in the real program is not filled with zeroes.
my curiosity is to understand how much data can a single program (mutithreaaded or multiprocess) output
It looks like 4 people didn't understand this simple question. I even put in bold the reason of the question.
Well it seems that linux scheduler and IO priorities played had a big role in the slowdown.
Also, spectre and other cpu vunerability mitigations came to play.
After further optimization, to achieve a faster speed I had to tune this things:
1) program nice level (nice -n -20)
2) program ionice level (ionice -c 1 -n 7)
3) pipe size increased 8 times.
4) disable cpu mitigations by adding "pti=off spectre_v2=off l1tf=off" in kernel command line
5) tuning the linux scheduler
echo -n -1 >/proc/sys/kernel/sched_rt_runtime_us
echo -n -1 >/proc/sys/kernel/sched_rt_period_us
echo -n -1 >/proc/sys/kernel/sched_rr_timeslice_ms
echo -n 0 >/proc/sys/kernel/sched_tunable_scaling
Now the program outputs (on the same pc) 8.00 GB/sec!
If you have other ideas you're welcome to contribute.