Search code examples
cperformancebuffer

How to choose the best buffer size when you need read large data


Let's assume a scenario where I have a lot of log files for a given system, let's imagine that it's petabytes of data. This is my scenario.

Used Technology

  • For my purpose, I'm going to choose the C/C++ to do this.

My Problem

  • I have the need to read these files, which are on disk, and do some processing later, whether sending them to a topic on some pub/sub system or simply displaying these logs on screen.

Questions

  • What is the best buffer size for me to have the best performance in reading this data and which saves hardware resources such as disk and RAM memory?
  • I just don't know if I should choose 64 Kilobytes, 128 Kilobytes, 5 Megabytes, 10 Megabytes, how do I calculate this?
  • And if this calculation depends on how much available resource I have, then how to calculate from these resources?

Solution

  • The optimal buffer size depends on many factors, most notably the hardware. You can find out which size is optimal by picking one size, measuring how long the operation takes then picking another size, measuring, comparing. Repeat until you find optimal size.

    Caveats:

    • You need to measure with the hardware matching the target system to have meaningful measurements.
    • You also need to measure with inputs comparable to the target task. You may reduce the size of input by using subset of real data to make measuring faster, but at some size it may affect the quality of measurement.
    • It's possible to encounter a local maxima buffer size that is faster than either slightly larger or smaller buffer, but not as fast as some other buffer size that is more larger or smaller. General global optimisation techniques may be used to avoid getting stuck in the search for the optimal value, such as simulated annealing.
    • Although benchmarking is a simple concept, it's actually quite difficult to do correctly. It's possible and likely that your measurements are biased by incidental factors that may cause differences in performance of the target system. Environment randomisation may help reduce this.

    Typical sizes that may be a good starting point to measure are the size of the caches on the system:

    • Cache line size
    • L1 cache size
    • L2 cache size
    • L3 cache size
    • Memory page size
    • SSD DRAM cache size