How to determine reasonable number of bytes to read per read system call?

I am playing with file reading/writing but have difficulty deciding how large to make my read buffer for the "read" system call.

In particular, I am looking at "http://pubs.opengroup.org/onlinepubs/009695399/functions/read.html"

It doesn't seem to say any restrictions on how many bytes I can read at once other than SSIZE_MAX.

To make matters worse, If I make an array with SSIZE_MAX characters, the program yields a:

sh: ./codec: Bad file number

Is there any reasonable way to decide how many bytes to read per read system call? My concern is that this may vary system to system (I can't just make as many reads as possible until a read fails to determine exact number of bytes I can read, and even if I do, it won't necessarily be any faster than reading less bytes).

One idea I had was to check my CPU cache size and try to make my buffer no larger than that, but since I don't know how CPU caches work, I am not sure if this is necessarily correct.

Thanks ahead of time.

Solution

I've pondered basically the same question, and I've come to a very simple conclusion:

Use a conservative default or heuristic, but let the user override it easily if they want.

You see, in some cases the user might not want the maximum throughput for your utility, but perhaps do whatever it is on the background. Perhaps the task is just not that important. Personally, in Linux, I often use nice and ionice utilities to put long-but-not-priority tasks on the back burner, so to speak, so that they don't interfere with my actual work.

Benchmarks within the last decade indicate 128k to 2M block sizes (2¹⁷ to 2²¹ bytes) to consistently work well -- not far from optimal rates in almost all situations --, with the average slowly shifting towards the larger end of that range. Typically, powers of two sizes seem to work better than non-powers-of-two, although I haven't seen enough benchmarks of various RAID configurations to trust that fully.

Because your utility will almost certainly be recompiled for each new hardware type/generation, I'd prefer to have a default block size, defined at compile time, but have it trivially overridden at run time (via a command-line option, environment variable, and/or configuration file).

If your utility is packaged for current POSIXy OSes, the binaries could use a default that seems to suit best for the types of tasks done on that machine; for example, Raspberry Pis and other SBCs often don't have that much memory to start with, so a smaller (say, 65536 bytes) default block size might work best. Desktop users might not care about memory hogs, so you might use a much larger default block size on current desktop machines.

(Servers, and in high performance computing (which is where I've pondered about this), the block size is basically either benchmarked on the exact hardware and workload, or it is just a barely-informed guess. Typically the latter.)

Alternatively, you could construct a heuristic based on the st_blksizes of the files involved, perhaps multiplied by a default factor, and clamped to some preferred range. However, such heuristics tend to bit-rot fast, as hardware changes.

With heuristics, it is important to remember that the idea is not to always achieve the optimum, but to avoid really poor results. If a user wants to squeeze out the last few percent of performance, they can do some benchmarking within their own workflow, and tune the defaults accordingly. (I personally have, and do.)