I have a Bash script that has a loop inside of which there is a Bash command that calls another Bash script which in turn calls Python scripts.
Each of these bash commands within the loops could be run independently from each other. When I later run it on an actual dataset, it takes some time to execute each command. Therefore, I would like to take advantage and parallelize this part of the script.
I spent a few days going over options in Bash that do parallel execution, while also giving me the option to choose the number of cores that I want to parallelize the code such that I wont flood the server. After looking for options the GNU, xargs -P
seemed to me the most reasonable, since I do not have to have a specific Bash version and it will work without installing extra libraries. However I am having difficulties making it work, even though it seems straight forward.
#!/bin/bash
while getopts i:t: option
do
case "${option}"
in
i) in_f=${OPTARG};;
t) n_threads=${OPTARG};;
esac
done
START=$(date +%s)
class_file=$in_f
classes=( $(awk '{print $1}' ./$class_file))
rm -r tree_matches.txt
n="${#classes[@]}"
for i in $(seq 0 $n);
do
for j in $(seq $((i+1)) $((n-1)));
do
echo ${classes[i]}" "${classes[j]} >> tree_matches.txt
done
done
col1=( $(awk '{print $1}' ./tree_matches.txt ))
col2=( $(awk '{print $2}' ./tree_matches.txt ))
printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}
n_pairs="${#col1[@]}"
END=$(date +%s)
DIFF=$(( $END - $START ))
echo "Exec time $DIFF seconds"
You can ignore the initial two nested loops, I just pasted the entire script for completeness. The part that is going to be parallelized is the 4th line of code counting from the end of the script:
printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}
This will loop over all pairs which is in my case 1275 in total and will ideally execute myFunction.sh
in parallel with the specified number of threads using the variable $n_threads
.
However, I am doing something wrong because the iterator k
in that line is not indexing my two arrays ${classes[k]}
and ${classes[k]}
.
The loop keeps iterating 1275 times but it only indexes the first element of both arrays when I echo them. I later changed that line to this for troubleshooting:
printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads echo "index" k
It is actually incrementing the value of k
each time it loops, however when I change that line to this:
printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads echo "index" "$((k))"
it is printing out 0, 1275 times as the value for k
. I don't know what I'm doing wrong.
I actually have two vectors that are the same sizes and are input for myFunction.sh
script. I just want an integer index to be able to index them at the same time and call my function with those two values that are indexed from those two vectors. I modified my code as follows based on your suggestion:
for x in {0..10};
do
printf "%d\0" "$x"; done| xargs -0 -I @@ -P $n_threads sh markerGenes2TreeMatch.sh -1 ${col1[@@]}-2 ${col2[@@]}
however now when I execute the code I get the following error:
@@: syntax error: operand expected (error token is "@@")
I guess this index @@
is still in string format. I just want integer indices to be generated by as I loop and can execute this command in parallel.
For the line in question:
printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}
${classes[k]}
will be expanded by the shell (to nothing most likely), before xargs has a chance to see it.
Perhaps you could reorder to:
for x in {0..1275}; do printf "%s\0" "${classes[$x]}"; done |\
xargs -0 -I @@ -P $n_threads sh myFunction.sh -1 @@ -2 @@