Search code examples
for-loopnested-loops

combine commands using loop


I want to perform alignment using Hisat2 for single-ended thousands of samples and each sample distributed among different libraries.

I have modified this script (https://www.biostars.org/p/223404/#224169):

#!/bin/bash
for f in `ls data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/*.fastq.bz2_trimmed.fq.gz | sed 's/.fastq.bz2_trimmed.fq.gz//g' | sort -u`
do
echo HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U ${f}.fastq.bz2_trimmed.fq.gz -S ${f}.bam
done

It gives me echo as:

HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460433.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460433.bam
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460434.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460434.bam
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460435.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460435.bam
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460436.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460436.bam

As this is the same sample (Alst-1_6989) is distributed among different lanes (SRR3460433,SRR3460434,SRR3460435,SRR3460436) which should be joined by comma not as a separate command as follows and I want the name of the sample (Alst-1_6989) in the output file (Alst-1_6989.bam), currently its name of the distributed library. Its just one example I have thousands of sample with a variable number of distributed library, so we need to keep this thing in mind.

HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460433.fastq.bz2_trimmed.fq.gz,data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460434.fastq.bz2_trimmed.fq.gz,data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460435.fastq.bz2_trimmed.fq.gz,data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460436.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/Alst-1_6989.bam

I think some neseted for loop can work or something like this, Any help will be highly appreciated.


Solution

  • Don't parse ls.

    Full path filenames shouldn't be dups, so I dropped the sort.
    I'm going to assume a reasonable number of files per sample.
    For that -

    base="$PWD"
    cmdtrunk="HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U "
    shopt -s nullglob                         # return empty if not matched
    for sample in data/1547_2015/Accessions/* # assumes no spaces, etc
    do [[ -d "$base/$sample" ]] || continue   # ignore files in this dir
       lst=( $( find "$base/$sample/transcriptome/fastq/trim/" -name \*.fastq.bz2_trimmed.fq.gz -print0 |
           while read -r -d '' f; do printf "%s\n" "$f"; done ) ) 
       if (( ${#lst[@]} ))
       then stack="$( printf "%s," "${lst[@]}" )"
            printf " %s\n" "$cmdtrunk ${stack%,} -S $base/$sample/${sample##*/}.bam"
       fi
    done
    

    I don't have anything like this structure, so haven't tested this as much as I'd like. Still, but all it does is print the commands, which you can save and inspect before executing.

    Let me know what's broken and we'll fix it.