Search code examples
bashfastaconsensus

How to concatenate fasta files with identical names into one file with different headers?


My problem is more on how to rename the header line for each fasta sequence, as I know how to concatenate a bunch of fasta files into one file. The problem is, after generating my files each file has the exact same header (name of the gene that was analyzed). So what I want to do is just combine the sequences but instead of keeping the same header, I want to use the filename as the header.

Example, I have two fasta files, the first being:

Homo_sapien_XYZ_20102.fa

And inside this file the sequence is:

>gene_X
ACTGAGGCCAATGAA...

Then a second file called:

Homo_sapein_ABC_20102.fa

>gene_X
CCCTGAGTAGAT...

When I concatenate these files I end up with one new file that has different sequences but identical headers (and due to the nature of the scripts I use to generate these individual sequences I cannot change the header name prior to this step).

>gene_X
ACTGAGGCCAATGAA...
>gene_X
CCCTGAGTAGAT...

This will be problematic so I was hoping to rewrite that header using the filename so it ends up being:

>Homo_sapien_XYZ_20102
ACTGAGGCCAATGAA...
>Homo_sapein_ABC_20102
CCCTGAGTAGAT...

Anyone know how to do this? The line of code I used to create one file of sequences is simply:

#!/bin/bash

for files in *_20102.fa
do
    cat ${files} >> geneA_consensus.fa
done

Solution

  • This works with my test set.

    for file in *.fasta
    do
       echo ">$file" >> out.fasta
       tail -n +2 $file >> out.fasta
       echo >> out.fasta
    done
    

    This simple version includes the filename extension.

    That last echo ensures the next header appears on its own line, even if the prior FASTA file did not end in a newline.