Search code examples
bashparallel-processingxargsgnu-parallel

Parallel execution of Bash commands


I have a Bash script that has a loop inside of which there is a Bash command that calls another Bash script which in turn calls Python scripts.

Each of these bash commands within the loops could be run independently from each other. When I later run it on an actual dataset, it takes some time to execute each command. Therefore, I would like to take advantage and parallelize this part of the script.

I spent a few days going over options in Bash that do parallel execution, while also giving me the option to choose the number of cores that I want to parallelize the code such that I wont flood the server. After looking for options the GNU, xargs -P seemed to me the most reasonable, since I do not have to have a specific Bash version and it will work without installing extra libraries. However I am having difficulties making it work, even though it seems straight forward.

#!/bin/bash

while getopts i:t: option
do
case "${option}"
in
    i) in_f=${OPTARG};;
    t) n_threads=${OPTARG};;
esac
done    

START=$(date +%s)
class_file=$in_f
classes=( $(awk '{print $1}' ./$class_file))
rm -r tree_matches.txt
n="${#classes[@]}"
for i in $(seq 0  $n);
   do
     for j in $(seq $((i+1)) $((n-1)));
         do
            echo ${classes[i]}"    "${classes[j]} >> tree_matches.txt
         done
   done
col1=( $(awk '{print $1}' ./tree_matches.txt ))
col2=( $(awk '{print $2}' ./tree_matches.txt ))


printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}

n_pairs="${#col1[@]}"

END=$(date +%s)
DIFF=$(( $END - $START ))
echo "Exec time $DIFF seconds"

You can ignore the initial two nested loops, I just pasted the entire script for completeness. The part that is going to be parallelized is the 4th line of code counting from the end of the script:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}

This will loop over all pairs which is in my case 1275 in total and will ideally execute myFunction.sh in parallel with the specified number of threads using the variable $n_threads.

However, I am doing something wrong because the iterator k in that line is not indexing my two arrays ${classes[k]} and ${classes[k]}.

The loop keeps iterating 1275 times but it only indexes the first element of both arrays when I echo them. I later changed that line to this for troubleshooting:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads echo "index" k

It is actually incrementing the value of k each time it loops, however when I change that line to this:

printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads echo "index" "$((k))"

it is printing out 0, 1275 times as the value for k. I don't know what I'm doing wrong.

I actually have two vectors that are the same sizes and are input for myFunction.sh script. I just want an integer index to be able to index them at the same time and call my function with those two values that are indexed from those two vectors. I modified my code as follows based on your suggestion:

 for x in {0..10};
    do
        printf "%d\0" "$x"; done| xargs -0 -I @@ -P $n_threads sh markerGenes2TreeMatch.sh -1 ${col1[@@]}-2 ${col2[@@]}

however now when I execute the code I get the following error:

@@: syntax error: operand expected (error token is "@@")

I guess this index @@ is still in string format. I just want integer indices to be generated by as I loop and can execute this command in parallel.


Solution

  • For the line in question:

    printf "%s\0" {0..1275} | xargs -0 -I k -P $n_threads sh myFunction.sh -1 ${classes[k]} -2 ${classes[k]}
    

    ${classes[k]} will be expanded by the shell (to nothing most likely), before xargs has a chance to see it.

    Perhaps you could reorder to:

    for x in {0..1275}; do printf "%s\0" "${classes[$x]}"; done |\
    xargs -0 -I @@ -P $n_threads sh myFunction.sh -1 @@ -2 @@