Search code examples
grepgnu-parallel

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?


How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.

The command in question:

parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv

where large_file.csv has in it:

123456    1
234567    2
345667    22

and where small_file.csv has in it:

    1$
    2$

and so on...

Thank you!


Solution

  • parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
    

    --block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.

    Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.

    The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.