What is optimal size of iostream::rdbuf() for heavy writing application?

I have application that write huge amount of data in file using ofstream.

Today by chance, I found several examples like this one:

const size_t bufsize = 256*1024;
char buf[bufsize];
mystream.rdbuf()->pubsetbuf(buf, bufsize);

What is the original value? 4KB? 16KB? Any method I can find it?

What is optimal value we can use here? 256KB? 1MB? What if we can spare 1GB?

Solution

First, let's discuss what's actually going on. What we are controlling are mostly some memcpys.

We fill our own data structure
ofstream copies that structure into its internal buffer
The kernel copies that buffer into the page cache
The disk subsystem reads the page cache via DMA

If the data structure is larger than the buffer, step 2 is skipped. Step 2 also requires locking and unlocking of a mutex unless you keep it locked via C++20's osyncstream.

In my experience step 2 can have significant overhead. So what you can do is artificially increase the size of step 1 by buffering multiple smaller write requests. Here is a simple benchmark to test this:

#include <cstdlib>
#include <fstream>
#include <iostream>
#include <memory>
#include <random>
#include <vector>


int main(int argc, char** argv)
{
  if(argc != 5) {
    std::cerr << "Usage: " << (argc ? argv[0] : "binary")
              << " filename filesize filebuffer membuffer\n";
    return 1;
  }
  const char* filename = argv[1];
  /* Total file size to hit or exceed */
  unsigned long long filesize = std::strtoull(argv[2], nullptr, 10);
  /*
   * Size of the ofstream-internal buffer
   * Setting this to 0 uses the platform-default
   */
  unsigned long long filebufsize = std::strtoull(argv[3], nullptr, 10);
  /*
   * Number of bytes to buffer before calling into ofstream
   * Setting this to 0 is equivalent to calling ofstream directly
   */
  unsigned long long membufsize = std::strtoull(argv[4], nullptr, 10);

  auto filebuf = std::make_unique<char[]>(filebufsize);
  std::ofstream out(filename);
  if(filebufsize > 0)
    out.rdbuf()->pubsetbuf(filebuf.get(), filebufsize);
  std::default_random_engine rng;
  // 100-200 bytes at once
  std::uniform_int_distribution<std::size_t> len_distr(100, 200);
  std::vector<char> membuf;
  for(std::size_t written = 0; written < filesize; written += membuf.size()) {
    membuf.clear();
    do {
      /* Simulates buffering multiple data blocks before calling ofstream */
      std::size_t blocksize = len_distr(rng);
      membuf.resize(membuf.size() + blocksize);
    } while(membuf.size() < membufsize);
    out.write(membuf.data(), membuf.size());
  }
}

And here is a bash script to run parameter combinations with the default values and buffer sizes between 4 kiB and 1 GiB.

#!/bin/bash

FILE=/dev/null
# 10 GiB
FILESIZE=$((10*1024**3))

run() {
    local filebuf="$1"
    local membuf="$2"
    echo "$filebuf $membuf"
    time -p ./a.out "$FILE" "$FILESIZE" "$filebuf" "$membuf"
}

# warmup. ignore first run
run 0 0
run 0 0
for((membuf=4096; membuf<=$((1024**3)); membuf*=2)); do
    run 0 $membuf
done
for((filebuf=4096; filebuf<=$((1024**3)); filebuf*=2)); do
    run $filebuf 0
    for((membuf=4096; membuf<filebuf; membuf*=2)); do
        run $filebuf $membuf
    done
done

/dev/null test

My hypothesis is that the first 3 steps are fastest if the buffer sizes are below the level 2 cache size so that memory bandwidth is maximized. I've tested this on a threadripper CPU. The first two test runs show this result:

0 0
real 3,99
user 3,79
sys 0,20
0 4096
real 1,72
user 1,33
sys 0,38

"0 0" means we simply call ofstream::write with 100-200 bytes at a time, no further changes. "0 4096" means we buffer about 4 kiB of data before calling ofstream::write. This already cuts the runtime in half! The overhead of ofstream is significant. I will not show all data. Larger vector sizes show relatively flat performance between 64 kiB and 2 MiB. In this particular run, the best performance was 1.25 seconds with 256 kiB. Larger sizes deteriorate performance as expected.

0 131072
real 1,29
user 1,28
sys 0,01
0 262144
real 1,25
user 1,24
sys 0,00
0 524288
real 1,29
user 1,28
sys 0,00
0 1048576
real 1,25
user 1,24
sys 0,00

[...]

0 536870912
real 1,46
user 1,36
sys 0,09
0 1073741824
real 1,57
user 1,40
sys 0,17

Increasing the ofstream buffer size to similar levels have no positive effect, e.g.

262144 0
real 3,81
user 3,64
sys 0,16
262144 4096
real 1,71
user 1,39
sys 0,32

All other changes basically verify these trends. The worst performance happens when increasing the ofstream buffer to 1 GiB without using larger memory buffers

1073741824 0
real 4,15
user 3,83
sys 0,32

Although not shown, tests on a tmpfs RAM disk have similar performance numbers, just with higher SYS load.

Real file system test

In a second test, I changed the file to an Ext4 filesystem on an NVME SSD. Here the higher overhead of the ofstream doesn't matter because it can still outperform the SSD. 33 seconds for 10 GiB means we get about 310 MiB/s. We still save CPU time by pre-buffering, though.

0 0
real 33,46
user 3,77
sys 6,49
0 4096
real 33,27
user 1,47
sys 7,32

Beyond that, there really isn't much to see and I aborted the test before finishing the 200 MiB block size.

Other aspects

Performance figures might look very differently if you run on a network or cluster filesystem. Those tend to favor larger block sizes but that might be more important for reading than writing to reduce the number of network roundtrips. Multithreading also helps keeping every component of data transfer busy at all times.

On faster local filesystems like RAIDs of high performance U2 SSDs or large RAID6 HDD arrays, I find that normal page-cached IO cannot exhaust disk bandwidth. In these cases I switch over to direct IO, about 1 MiB size per block, maybe overlapping 4 blocks (asynchronous IO, threads, Windows overlapping IO). However, if your total write size is smaller than main memory, you might still want to accept the slower page-cached write performance in exchange for keeping the data in cache for reading.

Conclusion

For anything you might find on a regular old desktop system, don't bother with ofstream buffer sizes. Buffer outside the ofstream for a few hundred kiB or wait until C++20 is widely available on all platforms that you want to support, then try osyncstream.