Search code examples
bashloopspipegnu-parallel

parallelizing nested for loop with GNU Parallel


I am working in Bash. I have a series of nested for loops that iteratively look for the presence of three lists of 96 barcodes sequences. My goal is to find each unique combination of barcodes there are 96x96x96 (884,736) possible combinations.

for barcode1 in "${ROUND1_BARCODES[@]}";
do
grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
echo barcode1.is.$barcode1 >> outputLOG

    if [ -s ROUND1_MATCH.fastq ]
    then

        # Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
        for barcode2 in "${ROUND2_BARCODES[@]}";
        do
        grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq

            if [ -s ROUND2_MATCH.fastq ]
            then

                # Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step 
                for barcode3 in "${ROUND3_BARCODES[@]}";
                do
                grep -B 1 -A 2 "$barcode3" ./ROUND2_MATCH.fastq | sed '/^--/d' > ROUND3_MATCH.fastq

                # If matches are found we will write them to an output .fastq file itteratively labelled with an ID number
                if [ -s ROUND3_MATCH.fastq ]
                then
                mv ROUND3_MATCH.fastq results/result.$count.2.fastq
                fi

                count=`expr $count + 1` 
                done
            fi
        done
    fi
done

This code works and I am able to successfully extract the sequences with each barcode combination. However, I think that the speed of this can be improved for working through large files by parallelizing this loop structure. I know that I can use GNU parallel to do this however I am struggling to nest the parallelizations.

# Parallelize nested loops
now=$(date +"%T")
echo "Beginning STEP1.2: PARALLEL Demultiplex using barcodes. Current 
time : $now" >> outputLOG

mkdir ROUND1_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} SRR6750041_2_smalltest.fastq > ROUND1_PARALLEL_HITS/{#}_ROUND1_MATCH.fastq' ::: "${ROUND1_BARCODES[@]}"

mkdir ROUND2_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND1_PARALLEL_HITS/*.fastq > ROUND2_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND2_BARCODES[@]}"

mkdir ROUND3_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND2_PARALLEL_HITS/*.fastq > ROUND3_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND3_BARCODES[@]}"

mkdir parallel_results
parallel -j 6 'mv {} parallel_results/result_{#}.fastq' ::: ROUND3_PARALLEL_HITS/*.fastq

How can I successfully recreate the nested structure of the for loops using parallel?


Solution

  • Parallelized only the inner loop:

    for barcode1 in "${ROUND1_BARCODES[@]}";
    do
    grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
    echo barcode1.is.$barcode1 >> outputLOG
    
        if [ -s ROUND1_MATCH.fastq ]
        then
    
            # Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
            for barcode2 in "${ROUND2_BARCODES[@]}";
            do
            grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq
                if [ -s ROUND2_MATCH.fastq ]
                then
                    # Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step 
                    doit() {
                        grep -B 1 -A 2 "$1" ./ROUND2_MATCH.fastq | sed '/^--/d'
                    }
                    export -f doit
                    parallel -j0 doit {} '>' results/$barcode1-$barcode2-{} ::: "${ROUND3_BARCODES[@]}"
                    # TODO remove files with 0 length
                fi
            done
        fi
    done