Search code examples
bashunixmergecatfastq

Merging files in folder with same file name except one character


I have filenames like the following:

fastqs/hgmm_100_S1_L001_R1_001.fastq.gz
fastqs/hgmm_100_S1_L002_R1_001.fastq.gz
fastqs/hgmm_100_S1_L003_R1_001.fastq.gz

fastqs/hgmm_100_S1_L001_R2_001.fastq.gz
fastqs/hgmm_100_S1_L002_R2_001.fastq.gz
fastqs/hgmm_100_S1_L003_R2_001.fastq.gz

And I want to merge them into the groups shown above, allowing LXXX to be merged.

I can do it like the following:

cat fastqs/hgmm_100_S1_L00?_R1_001.fastq.gz > data/hgmm_100_S1_R1_001.fastq.gz
cat fastqs/hgmm_100_S1_L00?_R2_001.fastq.gz > data/hgmm_100_S1_R2_001.fastq.gz

But this requires me to hard code each of the file groups in. How can I set it up such that it merges all of the L values into a group and outputs a file that is the same as the input file names, just without the L?

Thanks, Jack

EDIT:

Sorry for not including this in original post, but what if I had something like:

fastqs/hgmm_100_S1_L001_R1_001.fastq.gz
fastqs/hgmm_100_S1_L002_R1_001.fastq.gz
fastqs/hgmm_100_S1_L003_R1_001.fastq.gz

fastqs/hgmm_200_S1_L001_R2_001.fastq.gz
fastqs/hgmm_200_S1_L002_R2_001.fastq.gz
fastqs/hgmm_200_S1_L003_R2_001.fastq.gz

(Only change is the very beginning (100 -> 200))

How would this work? Essentially I want to merge these files as long as all parts of the name except for L??? is identical.


Solution

  • If the pattern _L###_ exists only in that one part of the filename, you might try something like this:

    #!/usr/bin/env bash
    
    # Define an associative array. Requires bash 4+
    declare -A a
    
    # Use extended glob notation. Read the man page or this.
    shopt -s extglob
    
    # Collect the file patterns by writing indexes in the array.
    for f in fastqs/*_L+([0-9])_*.fastq.gz; do
      a["${f/_L+([0-9])_/_*_}"]=1
    done
    
    # And finally, gather your files.
    for f in "${!a[@]}"; do
      # Strip any existing directory part of the filename to build our target
      target="data/${f##*/}"
      # Concatenate files matching the glob into our intended target
      cat $f > "${target/[*]_/}"
    done
    
    • We use Pattern Substitution to convert the variable part of each filespec into a glob.
    • We use the index of an associative array because it makes it easy to keep a unique list.
    • ${! lets us step through an array's indices rather than its values.