So I have this function in BASH that I'm trying to understand - and it uses parallelism:
function get_cache_files() {
## The maximum number of parallel processes. 16 since the cache
## naming scheme is hex based.
local max_parallel=${3-16}
## Get the cache files running grep in parallel for each top level
## cache dir.
find $2 -maxdepth 1 -type d | xargs -P $max_parallel -n 1 grep -Rl "KEY:.*$1" | sort -u
} # get_cache_files
So my questions:
php2-mindaugasb.c9.io/5c/c6/348e9a5b0e11fb6cd5948155c02cc65c
- why is it important to use 16 processes when the naming scheme is HEX based (hexadecimal system)?Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time. Use the -n option with -P; otherwise chances are that only one exec will be done.
Ok, so: "xargs -P $max_parallel -n 1" is correct and 16 processes will be initiated? Or should n be equal to $max_parallel also?
As I understand the conditions to parallelise are:
What are other conditions, circumstances when you can parallelise?
Ok, so: "xargs -P $max_parallel -n 1" is correct and 16 processes will be initiated? Or should n be equal to $max_parallel also?
Think of several bill counters in a store and a huge number customers waiting to pay the bill. -P
in analogy would be the number of bill counters (here, 16). -n
would be the number of customers one counter is able to handle at a time (here, 1). In this case, its easy to picture approximately equal sized queues on each counter, right?
From the perspective of the question, max_parallel=${3-16}
means that the variable is set to 16 if the $3 argument is not passed to the function. xargs
launches 16 (-P
parameter) parallel processes of grep
. Each of the processes gets exactly one line (-n
parameter) from the stdin of the xargs as the last command line parameter. In this case, the stdin of xargs is the output of the find command. Overall, the find command is going to list all the directories, the output of it is going to get consumed by 16 grep processes line by line. Each grep process will be invoked as:
grep -R1 "KEY:.*$1" <one line from find-output/xargs-input>
The comment: "16 since the cache naming scheme is hex based" - naming example is this: php2-mindaugasb.c9.io/5c/c6/348e9a5b0e11fb6cd5948155c02cc65c - why is it important to use 16 processes when the naming scheme is HEX based (hexadecimal system)?
I can not make out the logic behind this; but I think its more to do distribution and volume of data. If the total number of output lines from find is a multiple of 16, then it probably makes some sense.