I'm trying to limit a parallelized script. Aim of the script is to get a list within 10 samples/folders, and use the records of the list to perform a samtools command, which is the most demanding part.
this is the simple version:
for (10 items)
do
while read (list 5000 items)
do
command 1
command 2
command 3
...
samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &
### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
done
done
To make use of our local server, the script contains a forking command, which works. But it will fork until all of the server's resources are used and no one else can work on it.
Therefore I wanted to implement something like parallel -j 50
with gnu parallel. I tried it in front of the to-be-forked samtools
command such as
parallel -j 50 -k samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &
which did not work (also tried with backticks), I got either
[main_samview] region "item_from_list" specifies an unknown reference name. Continue anyway.
or somehow even vim
was invoked. But I'm also not sure whether this is the right position for the parallel
command within the script. Do you have any idea how to resolve this, so that the number of processes forked are limited?
I also thought of implementing something mentioned here https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop/103921 with the FIFO-based semaphore, but I hoped gnu parallel
can do what I'm looking for? I checked more pages such as https://zvfak.blogspot.de/2012/02/samtools-in-parallel.html and https://davetang.org/muse/2013/11/18/using-gnu-parallel/ but it is usually not this combination of issues.
This is a more detailed version of the script, in case any commands in there might be relevant (I heard the awk backticks and new lines in general might be a problem?)
cd path_to_data
for SAMPLE_FOLDER in *
do
cd ${SAMPLE_FOLDER}/another_folder
echo "$SAMPLE_FOLDER was found"
cat list_with_products.txt | while read PRODUCT_NAME_NO_SPACES
do
PRODUCT_NAME=`echo ${PRODUCT_NAME_NO_SPACES} | tr "@" " "`
echo "$PRODUCT_NAME with white spaces"
BED_FILENAME=${BED_DIR}/intersect_${PRODUCT_NAME_NO_SPACES}_${SAMPLE_FOLDER}.bed
grep "$PRODUCT_NAME" file_to_search_through > ${TMP_DIR}/tmp.gff
cat ${TMP_DIR}/tmp.gff | some 'awk' command > ${BED_FILENAME}
samtools view -L ${BED_FILENAME} another_input_file.bam | many | pipes | with | 'awk' | and | perl | etc > resultfolder/resultfile &
### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
rm ${TMP_DIR}/tmp.gff
done
cd (back_to_start)
done
rmdir -p ${OUTPUT_DIR}/tmp
Start by making a function that takes a single sample+single product as input:
cd path_to_data
# Process a single sample and a single product name
doit() {
SAMPLE_FOLDER="$1"
PRODUCT_NAME_NO_SPACES="$2"
SEQ="$3"
# Make sure temporary files are named uniquely
# so any parallel job will not overwrite these.
GFF=${TMP_DIR}/tmp.gff-$SEQ
cd ${SAMPLE_FOLDER}/another_folder
echo "$SAMPLE_FOLDER was found"
PRODUCT_NAME=`echo ${PRODUCT_NAME_NO_SPACES} | tr "@" " "`
echo "$PRODUCT_NAME with white spaces"
BED_FILENAME=${BED_DIR}/intersect_${PRODUCT_NAME_NO_SPACES}_${SAMPLE_FOLDER}.bed
grep "$PRODUCT_NAME" file_to_search_through > $GFF
cat $GFF | some 'awk' command > ${BED_FILENAME}
samtools view -L ${BED_FILENAME} another_input_file.bam | many | pipes | with | 'awk' | and | perl | etc
rm $GFF
rmdir -p ${OUTPUT_DIR}/tmp
}
export -f doit
# These variables are defined outside the function and must be exported to be visible
export BED_DIR
export TMP_DIR
export OUTPUT_DIR
# If there are many of these variables, use env_parallel instead of
# parallel. Then you do not need to export the variables.
If list_with_products.txt
is the same for every sample:
parallel --results outputdir/ doit {1} {2} {#} ::: * :::: path/to/list_with_products.txt
If list_with_products.txt
is different for each sample:
# Generate a list of:
# sample \t product
parallel --tag cd {}\;cat list_with_products.txt ::: * |
# call doit on each sample,product. Put output in outputdir
parallel --results outputdir/ --colsep '\t' doit {1} {2} {#}