Search code examples
bashloopsforkgnu-parallelsamtools

limit forked samtools processes with gnu parallel within for and while loop including awk


I'm trying to limit a parallelized script. Aim of the script is to get a list within 10 samples/folders, and use the records of the list to perform a samtools command, which is the most demanding part.

this is the simple version:

for (10 items)
do
  while read (list 5000 items)
  do
    command 1
    command 2
    command 3
    ...
    samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &
    ### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
  done
done

To make use of our local server, the script contains a forking command, which works. But it will fork until all of the server's resources are used and no one else can work on it.

Therefore I wanted to implement something like parallel -j 50 with gnu parallel. I tried it in front of the to-be-forked samtoolscommand such as

parallel -j 50 -k samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &

which did not work (also tried with backticks), I got either

[main_samview] region "item_from_list" specifies an unknown reference name. Continue anyway.

or somehow even vim was invoked. But I'm also not sure whether this is the right position for the parallelcommand within the script. Do you have any idea how to resolve this, so that the number of processes forked are limited?

I also thought of implementing something mentioned here https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop/103921 with the FIFO-based semaphore, but I hoped gnu parallel can do what I'm looking for? I checked more pages such as https://zvfak.blogspot.de/2012/02/samtools-in-parallel.html and https://davetang.org/muse/2013/11/18/using-gnu-parallel/ but it is usually not this combination of issues.

This is a more detailed version of the script, in case any commands in there might be relevant (I heard the awk backticks and new lines in general might be a problem?)

cd path_to_data
for SAMPLE_FOLDER in *
do
cd ${SAMPLE_FOLDER}/another_folder
    echo "$SAMPLE_FOLDER was found"

    cat list_with_products.txt | while read PRODUCT_NAME_NO_SPACES 
    do
    PRODUCT_NAME=`echo ${PRODUCT_NAME_NO_SPACES} | tr "@" " "`
        echo "$PRODUCT_NAME with white spaces"
        BED_FILENAME=${BED_DIR}/intersect_${PRODUCT_NAME_NO_SPACES}_${SAMPLE_FOLDER}.bed
        grep "$PRODUCT_NAME" file_to_search_through > ${TMP_DIR}/tmp.gff

        cat ${TMP_DIR}/tmp.gff | some 'awk' command  > ${BED_FILENAME}

        samtools view -L ${BED_FILENAME} another_input_file.bam | many | pipes | with | 'awk' | and | perl | etc > resultfolder/resultfile &
          ### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
        rm ${TMP_DIR}/tmp.gff
    done
    cd (back_to_start)
done
rmdir -p ${OUTPUT_DIR}/tmp

Solution

  • Start by making a function that takes a single sample+single product as input:

    cd path_to_data
    # Process a single sample and a single product name
    doit() {
        SAMPLE_FOLDER="$1"
        PRODUCT_NAME_NO_SPACES="$2"
        SEQ="$3"
        # Make sure temporary files are named uniquely
        # so any parallel job will not overwrite these.
        GFF=${TMP_DIR}/tmp.gff-$SEQ
        cd ${SAMPLE_FOLDER}/another_folder
        echo "$SAMPLE_FOLDER was found"
    
        PRODUCT_NAME=`echo ${PRODUCT_NAME_NO_SPACES} | tr "@" " "`
            echo "$PRODUCT_NAME with white spaces"
            BED_FILENAME=${BED_DIR}/intersect_${PRODUCT_NAME_NO_SPACES}_${SAMPLE_FOLDER}.bed
            grep "$PRODUCT_NAME" file_to_search_through > $GFF
    
            cat $GFF | some 'awk' command  > ${BED_FILENAME}
    
            samtools view -L ${BED_FILENAME} another_input_file.bam | many | pipes | with | 'awk' | and | perl | etc
            rm $GFF
        rmdir -p ${OUTPUT_DIR}/tmp
    }
    export -f doit
    # These variables are defined outside the function and must be exported to be visible
    export BED_DIR
    export TMP_DIR
    export OUTPUT_DIR
    # If there are many of these variables, use env_parallel instead of
    # parallel. Then you do not need to export the variables.
    

    If list_with_products.txt is the same for every sample:

    parallel --results outputdir/ doit {1} {2} {#} ::: * :::: path/to/list_with_products.txt
    

    If list_with_products.txt is different for each sample:

    # Generate a list of:
    # sample \t product
    parallel --tag cd {}\;cat list_with_products.txt ::: * |
      # call doit on each sample,product. Put output in outputdir
      parallel --results outputdir/ --colsep '\t' doit {1} {2} {#}