Search code examples
pythonbashfor-loopsample-data

How do I write a for-loop so a program reiterates itself for a set of 94 DNA samples?


I have written some code in a bash shell (so I can submit it to my university's supercomputer) to edit out contaminant sequences from a batch of DNA extracts I have. Essentially what this code does is take the sequences from the negative extraction blank I did (A1-BLANK) and subtract it from all of the other samples.

I have figured out how to get this to work with individual samples, but I'm attempting to write a for loop so that the little chunks of code will reiterate themselves for each sample, with the outcome of this file being a .sam file with a unique name for each sample where both the forward and reverse reads for the sample are merged and edited for contamination.I have checked stack overflow extensively for help with this specific problem, but haven't been able to apply related answered questions to my code.

Here's an example of part what I'm trying to do for an individual sample, named F10-61C-3-V4_S78_L001_R1_001.fastq:

bowtie2 -q --end-to-end --very-sensitive \ ##bowtie2 is a program that examines sequence similarity compared to a standard
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \ ##This line compares the sample to the negative extraction blank
-1  /file directory/F10-61C-3-V4_S78_L001_R1_001.fastq 
-2 /file directory/F10-61C-3-V4_S78_L001_R2_001.fastq \ ##These two lines above merge the forward and reverse reads of the DNA sequences within the individual files into one file
-S 61C-3.sam ##This line renames the merged and edited file and transforms it into a .sam file

Here's what I've got so far for this little step of the process:


for file in /file directory/*.fastq

do

bowtie2 -q --end-to-end --very-sensitive \
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \
-1  /file directory/*.fastq 
-2 /file directory/*.fastq \
-S *.sam

done

In my resulting slurm file, the error I'm getting right now has to do with the -S command. I'm not sure how to give each merged and edited sample a unique name for the .sam file. I'm new to writing for loops in python (my only experience is in R) and I'm sure it's a simple fix, but I haven't been able to find any specific answers to this question.


Solution

  • This may work as your example but automatically select the output file name referrence with a RegEx:

    #!/usr/bin/env bash
    
    input_samples='/input_samples_directory'
    output_samples='/output_merged_samples_directory'
    
    while IFS= read -r -d '' R1_fastq; do
      # Deduce R2 sample from R1 sample file name
      R2_fastq="${R1_fastq/_R1_/_R2_}"
      # RegEx match capture group in () for the output sample reference
      [[ $R1_fastq =~ [^-]+-([[:digit:]]+[[:alpha:]]-[[:digit:]]).* ]]
      # Construct the output sample file path with the captured referrenced
      # from the RegEx above
      sam="$output_samples/${BASH_REMATCH[1]}.sam"
      # Perform the merging
      bowtie2 -q --end-to-end --very-sensitive \
        -N 0 -L 31 --time --reorder \
        -x A1-BlankIndex \
        -1 "$R1_fastq" \
        -2 "$R2_fastq" \
        -S "$sam"
    done < <(find "$input_samples" -maxdepth 1 -type -f -name '*_R1_*.fastq' -print0)