Search code examples
bashfastq

Bash script to concatenate text files with specific substrings in filenames


Within a certain directory I have many directories containing a bunch of text files. I’m trying to write a script that concatenates only those files in each directory that have the string ‘R1’ in their filename into one file within that specific directory, and those that have ‘R2’ in another . This is what I wrote but it’s not working.

#!/bin/bash

for f in */*.fastq; do

    if grep 'R1' $f ; then
        cat "$f" >> R1.fastq
    fi

    if grep 'R2' $f ; then
        cat "$f" >> R2.fastq
    fi

done

I get no errors and the files are created as intended but they are empty files. Can anyone tell me what I’m doing wrong?

Thank you all for the fast and detailed responses! I think I wasn't very clear in my question, but I need the script to only concatenate the files within each specific directory so that each directory has a new file ( R1 and R2). I tried doing

cat /*R1*.fastq >*/R1.fastq 

but it gave me an ambiguous redirect error. I also tried Charles Duffy's for loop but looping through the directories and doing a nested loop to run though each file within a directory like so

for f in */; do
   for d in "$f"/*.fastq;do
     case "$d" in
       *R1*) cat "$d" >&3
       *R2*) cat "$d" >&4
     esac
   done 3>R1.fastq 4>R2.fastq
done

but it was giving an unexpected token error regarding ')'.

Sorry in advance if I'm missing something elementary, I'm still very new to bash.


Solution

  • A Note To The Reader

    Please review edit history on the question in considering this answer; several parts have been made less relevant by question edits.

    One cat Per Output File

    For the purpose at hand, you can probably just let shell globbing do all the work (if R1 or R2 will be in the filenames, as opposed to the directory names):

    set -x # log what's happening!
    cat */*R1*.fastq >R1.fastq
    cat */*R2*.fastq >R2.fastq
    

    One find Per Output File

    If it's a really large number of files, by contrast, you might need find:

    find . -mindepth 2 -maxdepth 2 -type f -name '*R1*.fastq' -exec cat '{}' + >R1.fastq
    find . -mindepth 2 -maxdepth 2 -type f -name '*R2*.fastq' -exec cat '{}' + >R2.fastq
    

    ...this is because of the OS-dependent limit on command-line length; the find command given above will put as many arguments onto each cat command as possible for efficiency, but will still split them up into multiple invocations where otherwise the limit would be exceeded.


    Iterate-And-Test

    If you really do want to iterate over everything, and then test the names, consider a case statement for the job, which is much more efficient than using grep to check just one line:

    for f in */*.fastq; do
      case $f in
        *R1*) cat "$f" >&3
        *R2*) cat "$f" >&4
      esac
    done 3>R1.fastq 4>R2.fastq
    

    Note the use of file descriptors 3 and 4 to write to R1.fastq and R2.fastq respectively -- that way we're only opening the output files once (and thus truncating them exactly once) when the for loop starts, and reusing those file descriptors rather than re-opening the output files at the beginning of each cat. (That said, running cat once per file -- which find -exec {} + avoids -- is probably more overhead on balance).


    Operating Per-Directory

    All of the above can be updated to work on a per-directory basis quite trivially. For example:

    for d in */; do
      find "$d" -name R1.fastq -prune -o -name '*R1*.fastq' -exec cat '{}' + >"$d/R1.fastq"
      find "$d" -name R2.fastq -prune -o -name '*R2*.fastq' -exec cat '{}' + >"$d/R2.fastq"
    done
    

    There are only two significant changes:

    • We're no longer specifying -mindepth, to ensure that our input files only come from subdirectories.
    • We're excluding R1.fastq and R2.fastq from our input files, so we never try to use the same file as both input and output. This is a consequence of the prior change: Previously, our output files couldn't be considered as input because they didn't meet the minimum depth.