I want to perform alignment using Hisat2 for single-ended thousands of samples and each sample distributed among different libraries.
I have modified this script (https://www.biostars.org/p/223404/#224169):
#!/bin/bash
for f in `ls data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/*.fastq.bz2_trimmed.fq.gz | sed 's/.fastq.bz2_trimmed.fq.gz//g' | sort -u`
do
echo HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U ${f}.fastq.bz2_trimmed.fq.gz -S ${f}.bam
done
It gives me echo as:
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460433.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460433.bam
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460434.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460434.bam
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460435.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460435.bam
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460436.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460436.bam
As this is the same sample (Alst-1_6989) is distributed among different lanes (SRR3460433,SRR3460434,SRR3460435,SRR3460436) which should be joined by comma not as a separate command as follows and I want the name of the sample (Alst-1_6989) in the output file (Alst-1_6989.bam), currently its name of the distributed library. Its just one example I have thousands of sample with a variable number of distributed library, so we need to keep this thing in mind.
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460433.fastq.bz2_trimmed.fq.gz,data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460434.fastq.bz2_trimmed.fq.gz,data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460435.fastq.bz2_trimmed.fq.gz,data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460436.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/Alst-1_6989.bam
I think some neseted for loop can work or something like this, Any help will be highly appreciated.
Full path filenames shouldn't be dups, so I dropped the sort
.
I'm going to assume a reasonable number of files per sample.
For that -
base="$PWD"
cmdtrunk="HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U "
shopt -s nullglob # return empty if not matched
for sample in data/1547_2015/Accessions/* # assumes no spaces, etc
do [[ -d "$base/$sample" ]] || continue # ignore files in this dir
lst=( $( find "$base/$sample/transcriptome/fastq/trim/" -name \*.fastq.bz2_trimmed.fq.gz -print0 |
while read -r -d '' f; do printf "%s\n" "$f"; done ) )
if (( ${#lst[@]} ))
then stack="$( printf "%s," "${lst[@]}" )"
printf " %s\n" "$cmdtrunk ${stack%,} -S $base/$sample/${sample##*/}.bam"
fi
done
I don't have anything like this structure, so haven't tested this as much as I'd like. Still, but all it does is print the commands, which you can save and inspect before executing.
Let me know what's broken and we'll fix it.