I have a directory with several sub-directories with names
1
2
3
4
backup_1
backup_2
I wrote a parallelized bash code to process files in these folders and a minimum working example is as follows:
#!/bin/bash
P=`pwd`
task(){
dirname=$(basename $dir)
echo $dirname running >> output.out
if [[ $dirname != "backup"* ]]; then
sed -i "s/$dirname running/$dirname is good/" $P/output.out
else
sed -i "s/$dirname running/$dirname ignored/" $P/output.out
fi
}
for dir in */; do
((i=i%8)); ((i++==0)) && wait
task "$dir" &
done
wait
echo all done
The "wait" at the end of the script is supposed to wait for all processes to finish before proceeding to echo "all done". The output.out file, after all processes are finished should show
1 is good
2 is good
3 is good
4 is good
backup_1 ignored
backup_2 ignored
I am able to get this output if I set the script to run in serial with ((i=i%1)); ((i++==0)) && wait
. However, if I run it in parallel with ((i=i%2)); ((i++==0)) && wait
, I get something like
2 is good
1 running
3 running
4 is good
backup_1 running
backup_2 ignored
Can anyone tell me why is wait not working in this case?
I also know that GNU parallel can do the same thing in parallelizing tasks. However, I don't know how to command parallel to run this task on all sub-directories in the parent directory. It'll be great is someone can produce a sample script that I can follow.
Many thanks Jacek
A literal porting to GNU Parallel looks like this:
task(){
dir="$1"
P=`pwd`
dirname=$(basename $dir)
echo $dirname running >> output.out
if [[ $dirname != "backup"* ]]; then
sed -i "s/$dirname running/$dirname is good/" $P/output.out
else
sed -i "s/$dirname running/$dirname ignored/" $P/output.out
fi
}
export -f task
parallel -j8 task ::: */
echo all done
As others point out you have race conditions when you run sed
on the same file in parallel.
To avoid race conditions you could do:
task(){
dir="$1"
P=`pwd`
dirname=$(basename $dir)
echo $dirname running
if [[ $dirname != "backup"* ]]; then
echo "$dirname is good" >&2
else
echo "$dirname ignored" >&2
fi
}
export -f task
parallel -j8 task ::: */ >running.out 2>done.out
echo all done
You will end up with two files running.out and done.out.
If you really just want to ignore the dirs called backup*
:
task(){
dir="$1"
P=`pwd`
dirname=$(basename $dir)
echo $dirname running
echo "$dirname is good" >&2
}
export -f task
parallel -j8 task '{=/backup/ and skip()=}' ::: */ >running.out 2>done.out
echo all done
Consider spending 20 minutes on reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 Your command line will love you for it.