Search code examples
linuxbashshellcommand-line-argumentsxargs

Xargs parallelism in Bash


So I have this function in BASH that I'm trying to understand - and it uses parallelism:

function get_cache_files() {
    ## The maximum number of parallel processes. 16 since the cache
    ## naming scheme is hex based.
    local max_parallel=${3-16}
    ## Get the cache files running grep in parallel for each top level
    ## cache dir.
    find $2 -maxdepth 1 -type d | xargs -P $max_parallel -n 1 grep -Rl "KEY:.*$1" | sort -u
} # get_cache_files

So my questions:

  1. The comment: "16 since the cache naming scheme is hex based" - naming example is this: php2-mindaugasb.c9.io/5c/c6/348e9a5b0e11fb6cd5948155c02cc65c - why is it important to use 16 processes when the naming scheme is HEX based (hexadecimal system)?
  2. The -P option for XARGS is for max-procs:

Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time. Use the -n option with -P; otherwise chances are that only one exec will be done.

Ok, so: "xargs -P $max_parallel -n 1" is correct and 16 processes will be initiated? Or should n be equal to $max_parallel also?

  1. As I understand the conditions to parallelise are:

    1. Independence of resources on which the operations will be performed (like similar files on which the operations will be performed);
    2. Operations are performed on independent computers;

    What are other conditions, circumstances when you can parallelise?


Solution

  • Ok, so: "xargs -P $max_parallel -n 1" is correct and 16 processes will be initiated? Or should n be equal to $max_parallel also?

    Think of several bill counters in a store and a huge number customers waiting to pay the bill. -P in analogy would be the number of bill counters (here, 16). -n would be the number of customers one counter is able to handle at a time (here, 1). In this case, its easy to picture approximately equal sized queues on each counter, right?

    From the perspective of the question, max_parallel=${3-16} means that the variable is set to 16 if the $3 argument is not passed to the function. xargs launches 16 (-P parameter) parallel processes of grep. Each of the processes gets exactly one line (-n parameter) from the stdin of the xargs as the last command line parameter. In this case, the stdin of xargs is the output of the find command. Overall, the find command is going to list all the directories, the output of it is going to get consumed by 16 grep processes line by line. Each grep process will be invoked as:

    grep -R1 "KEY:.*$1" <one line from find-output/xargs-input>
    

    The comment: "16 since the cache naming scheme is hex based" - naming example is this: php2-mindaugasb.c9.io/5c/c6/348e9a5b0e11fb6cd5948155c02cc65c - why is it important to use 16 processes when the naming scheme is HEX based (hexadecimal system)?

    I can not make out the logic behind this; but I think its more to do distribution and volume of data. If the total number of output lines from find is a multiple of 16, then it probably makes some sense.