Search code examples
bashparallel-processingwaitgnu-parallel

bash wait for all processes to finish (doesn't work)


I have a directory with several sub-directories with names

1
2
3
4
backup_1
backup_2

I wrote a parallelized bash code to process files in these folders and a minimum working example is as follows:

#!/bin/bash
P=`pwd`
task(){
    dirname=$(basename $dir)
    echo $dirname running >> output.out
    if [[ $dirname != "backup"* ]]; then
        sed -i "s/$dirname running/$dirname is good/" $P/output.out
    else
        sed -i "s/$dirname running/$dirname ignored/" $P/output.out
    fi
}

for dir in */; do
    ((i=i%8)); ((i++==0)) && wait
    task "$dir" &
done
wait
echo all done

The "wait" at the end of the script is supposed to wait for all processes to finish before proceeding to echo "all done". The output.out file, after all processes are finished should show

1 is good
2 is good
3 is good
4 is good
backup_1 ignored
backup_2 ignored

I am able to get this output if I set the script to run in serial with ((i=i%1)); ((i++==0)) && wait. However, if I run it in parallel with ((i=i%2)); ((i++==0)) && wait, I get something like

2 is good
1 running
3 running
4 is good
backup_1 running
backup_2 ignored

Can anyone tell me why is wait not working in this case?

I also know that GNU parallel can do the same thing in parallelizing tasks. However, I don't know how to command parallel to run this task on all sub-directories in the parent directory. It'll be great is someone can produce a sample script that I can follow.

Many thanks Jacek


Solution

  • A literal porting to GNU Parallel looks like this:

    task(){
        dir="$1"
        P=`pwd`
        dirname=$(basename $dir)
        echo $dirname running >> output.out
        if [[ $dirname != "backup"* ]]; then
            sed -i "s/$dirname running/$dirname is good/" $P/output.out
        else
            sed -i "s/$dirname running/$dirname ignored/" $P/output.out
        fi
    }
    export -f task
    
    parallel -j8 task ::: */
    echo all done
    

    As others point out you have race conditions when you run sed on the same file in parallel.

    To avoid race conditions you could do:

    task(){
        dir="$1"
        P=`pwd`
        dirname=$(basename $dir)
        echo $dirname running
        if [[ $dirname != "backup"* ]]; then
            echo "$dirname is good" >&2
        else
            echo "$dirname ignored" >&2
        fi
    }
    export -f task
    
    parallel -j8 task ::: */ >running.out 2>done.out
    echo all done
    

    You will end up with two files running.out and done.out.

    If you really just want to ignore the dirs called backup*:

    task(){
        dir="$1"
        P=`pwd`
        dirname=$(basename $dir)
        echo $dirname running
        echo "$dirname is good" >&2
    }
    export -f task
    
    parallel -j8 task '{=/backup/ and skip()=}' ::: */ >running.out 2>done.out
    echo all done
    

    Consider spending 20 minutes on reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 Your command line will love you for it.