I am working in Bash. I have a series of nested for loops that iteratively look for the presence of three lists of 96 barcodes sequences. My goal is to find each unique combination of barcodes there are 96x96x96 (884,736) possible combinations.
for barcode1 in "${ROUND1_BARCODES[@]}";
do
grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
echo barcode1.is.$barcode1 >> outputLOG
if [ -s ROUND1_MATCH.fastq ]
then
# Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
for barcode2 in "${ROUND2_BARCODES[@]}";
do
grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq
if [ -s ROUND2_MATCH.fastq ]
then
# Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step
for barcode3 in "${ROUND3_BARCODES[@]}";
do
grep -B 1 -A 2 "$barcode3" ./ROUND2_MATCH.fastq | sed '/^--/d' > ROUND3_MATCH.fastq
# If matches are found we will write them to an output .fastq file itteratively labelled with an ID number
if [ -s ROUND3_MATCH.fastq ]
then
mv ROUND3_MATCH.fastq results/result.$count.2.fastq
fi
count=`expr $count + 1`
done
fi
done
fi
done
This code works and I am able to successfully extract the sequences with each barcode combination. However, I think that the speed of this can be improved for working through large files by parallelizing this loop structure. I know that I can use GNU parallel to do this however I am struggling to nest the parallelizations.
# Parallelize nested loops
now=$(date +"%T")
echo "Beginning STEP1.2: PARALLEL Demultiplex using barcodes. Current
time : $now" >> outputLOG
mkdir ROUND1_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} SRR6750041_2_smalltest.fastq > ROUND1_PARALLEL_HITS/{#}_ROUND1_MATCH.fastq' ::: "${ROUND1_BARCODES[@]}"
mkdir ROUND2_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND1_PARALLEL_HITS/*.fastq > ROUND2_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND2_BARCODES[@]}"
mkdir ROUND3_PARALLEL_HITS
parallel -j 6 'grep -B 1 -A 2 -h {} ROUND2_PARALLEL_HITS/*.fastq > ROUND3_PARALLEL_HITS/{#}_{/.}.fastq' ::: "${ROUND3_BARCODES[@]}"
mkdir parallel_results
parallel -j 6 'mv {} parallel_results/result_{#}.fastq' ::: ROUND3_PARALLEL_HITS/*.fastq
How can I successfully recreate the nested structure of the for loops using parallel?
Parallelized only the inner loop:
for barcode1 in "${ROUND1_BARCODES[@]}";
do
grep -B 1 -A 2 "$barcode1" $FASTQ_R > ROUND1_MATCH.fastq
echo barcode1.is.$barcode1 >> outputLOG
if [ -s ROUND1_MATCH.fastq ]
then
# Now we will look for the presence of ROUND2 barcodes in our reads containing barcodes from the previous step
for barcode2 in "${ROUND2_BARCODES[@]}";
do
grep -B 1 -A 2 "$barcode2" ROUND1_MATCH.fastq > ROUND2_MATCH.fastq
if [ -s ROUND2_MATCH.fastq ]
then
# Now we will look for the presence of ROUND3 barcodes in our reads containing barcodes from the previous step
doit() {
grep -B 1 -A 2 "$1" ./ROUND2_MATCH.fastq | sed '/^--/d'
}
export -f doit
parallel -j0 doit {} '>' results/$barcode1-$barcode2-{} ::: "${ROUND3_BARCODES[@]}"
# TODO remove files with 0 length
fi
done
fi
done