Search code examples
linuxbashdna-sequence

How to call a large list of paired files to be executed by a program in BASH?


I have a large directory of files (100+) that I'd like to pass through a program via the terminal.

The files are paired and all follow a naming scheme like such:

 TS-8_S53_L001_R1_001.fastq 
 TS-8_S53_L001_R2_001.fastq
 RS-9_S54_L001_R1_001.fastq 
 RS-9_S54_L001_R2_001.fastq

And the program execution looks like:

Seqprogram -i1 Blah_R1_001.fastq -i2 Blah_R2_001.fastq -o Blah_paired.fastq

All of these files are in one directory.

I'd like to able to run the program on all of the files, using the files paired together in the proper sequence (R1 files are passed through i1, the R1 and R2 files have the same base name) and the output file (-o) is saved under the base name with some identifier attached ("_paired", etc).

I've envisioned on how I'd do this over Python; however, I am trying to get better with BASH.

I'm familiar with how one might call multiple files into a single command; i.e., uncompressing all .gz files in a particular directory

gunzip "*.gz"

But this command has two inputs, and the inputs must be ordered, so the wildcard scheme isn't sufficient.

Thanks


Solution

  • Use a wildcard to get one file of the pair, and then use parameter substitution to get the other corresponding filenames.

    for i1 in *_R1_001.fastq; do
        i2=${i1/R1_001/R2_001}
        paired=${i1/R1_001/paired}
        Seqprogram -i1 "$i1" -i2 "$i2" -o "$paired"
    done