I have application that write huge amount of data in file using ofstream
.
Today by chance, I found several examples like this one:
const size_t bufsize = 256*1024;
char buf[bufsize];
mystream.rdbuf()->pubsetbuf(buf, bufsize);
What is the original value? 4KB? 16KB? Any method I can find it?
What is optimal value we can use here? 256KB? 1MB? What if we can spare 1GB?
First, let's discuss what's actually going on. What we are controlling are mostly some memcpys.
ofstream
copies that structure into its internal bufferIf the data structure is larger than the buffer, step 2 is skipped. Step 2 also requires locking and unlocking of a mutex unless you keep it locked via C++20's osyncstream
.
In my experience step 2 can have significant overhead. So what you can do is artificially increase the size of step 1 by buffering multiple smaller write requests. Here is a simple benchmark to test this:
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <memory>
#include <random>
#include <vector>
int main(int argc, char** argv)
{
if(argc != 5) {
std::cerr << "Usage: " << (argc ? argv[0] : "binary")
<< " filename filesize filebuffer membuffer\n";
return 1;
}
const char* filename = argv[1];
/* Total file size to hit or exceed */
unsigned long long filesize = std::strtoull(argv[2], nullptr, 10);
/*
* Size of the ofstream-internal buffer
* Setting this to 0 uses the platform-default
*/
unsigned long long filebufsize = std::strtoull(argv[3], nullptr, 10);
/*
* Number of bytes to buffer before calling into ofstream
* Setting this to 0 is equivalent to calling ofstream directly
*/
unsigned long long membufsize = std::strtoull(argv[4], nullptr, 10);
auto filebuf = std::make_unique<char[]>(filebufsize);
std::ofstream out(filename);
if(filebufsize > 0)
out.rdbuf()->pubsetbuf(filebuf.get(), filebufsize);
std::default_random_engine rng;
// 100-200 bytes at once
std::uniform_int_distribution<std::size_t> len_distr(100, 200);
std::vector<char> membuf;
for(std::size_t written = 0; written < filesize; written += membuf.size()) {
membuf.clear();
do {
/* Simulates buffering multiple data blocks before calling ofstream */
std::size_t blocksize = len_distr(rng);
membuf.resize(membuf.size() + blocksize);
} while(membuf.size() < membufsize);
out.write(membuf.data(), membuf.size());
}
}
And here is a bash script to run parameter combinations with the default values and buffer sizes between 4 kiB and 1 GiB.
#!/bin/bash
FILE=/dev/null
# 10 GiB
FILESIZE=$((10*1024**3))
run() {
local filebuf="$1"
local membuf="$2"
echo "$filebuf $membuf"
time -p ./a.out "$FILE" "$FILESIZE" "$filebuf" "$membuf"
}
# warmup. ignore first run
run 0 0
run 0 0
for((membuf=4096; membuf<=$((1024**3)); membuf*=2)); do
run 0 $membuf
done
for((filebuf=4096; filebuf<=$((1024**3)); filebuf*=2)); do
run $filebuf 0
for((membuf=4096; membuf<filebuf; membuf*=2)); do
run $filebuf $membuf
done
done
My hypothesis is that the first 3 steps are fastest if the buffer sizes are below the level 2 cache size so that memory bandwidth is maximized. I've tested this on a threadripper CPU. The first two test runs show this result:
0 0
real 3,99
user 3,79
sys 0,20
0 4096
real 1,72
user 1,33
sys 0,38
"0 0" means we simply call ofstream::write
with 100-200 bytes at a time, no further changes. "0 4096" means we buffer about 4 kiB of data before calling ofstream::write
. This already cuts the runtime in half! The overhead of ofstream
is significant. I will not show all data. Larger vector sizes show relatively flat performance between 64 kiB and 2 MiB. In this particular run, the best performance was 1.25 seconds with 256 kiB. Larger sizes deteriorate performance as expected.
0 131072
real 1,29
user 1,28
sys 0,01
0 262144
real 1,25
user 1,24
sys 0,00
0 524288
real 1,29
user 1,28
sys 0,00
0 1048576
real 1,25
user 1,24
sys 0,00
[...]
0 536870912
real 1,46
user 1,36
sys 0,09
0 1073741824
real 1,57
user 1,40
sys 0,17
Increasing the ofstream
buffer size to similar levels have no positive effect, e.g.
262144 0
real 3,81
user 3,64
sys 0,16
262144 4096
real 1,71
user 1,39
sys 0,32
All other changes basically verify these trends. The worst performance happens when increasing the ofstream
buffer to 1 GiB without using larger memory buffers
1073741824 0
real 4,15
user 3,83
sys 0,32
Although not shown, tests on a tmpfs
RAM disk have similar performance numbers, just with higher SYS load.
In a second test, I changed the file to an Ext4 filesystem on an NVME SSD. Here the higher overhead of the ofstream
doesn't matter because it can still outperform the SSD. 33 seconds for 10 GiB means we get about 310 MiB/s. We still save CPU time by pre-buffering, though.
0 0
real 33,46
user 3,77
sys 6,49
0 4096
real 33,27
user 1,47
sys 7,32
Beyond that, there really isn't much to see and I aborted the test before finishing the 200 MiB block size.
Performance figures might look very differently if you run on a network or cluster filesystem. Those tend to favor larger block sizes but that might be more important for reading than writing to reduce the number of network roundtrips. Multithreading also helps keeping every component of data transfer busy at all times.
On faster local filesystems like RAIDs of high performance U2 SSDs or large RAID6 HDD arrays, I find that normal page-cached IO cannot exhaust disk bandwidth. In these cases I switch over to direct IO, about 1 MiB size per block, maybe overlapping 4 blocks (asynchronous IO, threads, Windows overlapping IO). However, if your total write size is smaller than main memory, you might still want to accept the slower page-cached write performance in exchange for keeping the data in cache for reading.
For anything you might find on a regular old desktop system, don't bother with ofstream
buffer sizes. Buffer outside the ofstream
for a few hundred kiB or wait until C++20 is widely available on all platforms that you want to support, then try osyncstream
.