Search code examples
loopsunixconcatenation

How to cat similar named sequence files from different directories into single large fasta file


I am trying to get the following done. I have circa 40 directories of different species, each with 100s of sequence files that contain orthologous sequences. The sequence files are similarly named for each of the species directories. I want to concatenate the identically named files of the 40 species directories into a single sequence file which is named similarly.

My data looks as follows, e.g.:

directories: Species1 Species2 Species3 
  Within directory (similar for all): sequenceA.fasta sequenceB.fasta sequenceC.fasta

I want to get single files named: sequenceA.fasta sequenceB.fasta sequenceC.fasta 
where the content of the different files from the different species is concatenated.

I tried to solve this with a loop (but this never ends well with me!):

ls . | while read FILE; do cat ./*/"$FILE" >> ./final/"$FILE"; done

This resulted in empty files and errors. I did try to find a solution elsewhere, e.g.: (https://www.unix.com/unix-for-dummies-questions-and-answers/249952-cat-multiple-files-according-file-name.html, https://unix.stackexchange.com/questions/424204/how-to-combine-multiple-files-with-similar-names-in-different-folders-by-using-u) but I have been unable to edit them to my case.

Could anyone give me some help here? Thanks!


Solution

  • In a root directory where your species directories reside, you should run the following:

    $ mkdir output
    $ find Species* -type f -name "*.fasta" -exec sh -c 'cat {} >> output/`basename {}`' \;
    

    It traverses all the files recursively and merges the contents of files with identical basename into one under output directory.

    EDIT: even though this was an accepted answer, in a comment the OP mentioned that the real directories don't match a common pattern Species* as shown in the original question. In this case you can use this:

    $ find -type f -not -path "./output/*" -name "*.fasta" -exec sh -c 'cat {} >> output/`basename {}`' \;
    

    This way, we don't specify the search pattern but rather explicitly omit output directory to avoid duplicates of already processed data.