Search code examples
sedheaderfasta

How to trim the header in a fasta sequence to the species name and keeping the main text of the sequence using sed command?


The fasta file name: STD_PRO_1.fasta

I have multiple headers in that fasta file as such:

>ENA|AB000176|AB000176.1 Escherichia coli DNA for mannosyl transferase, phosphoribosyl-ATP pyrophosphohydrolase:phosphoribosyl-AMP cyclohydrolase, partial cds. GACCATATGATTGACGCCTATGTCAATCTCTACACTACATTGCTGGAAAGCAAATCCTGA GAGATGCTACCCCCGCCGTTGCTGCGGGGGCCAACGCGTTAATGCCGATTCTTCAGATTA TCAATCACTTCTCCGAGATCCAGCCCTTGATCCTGCAACAACACCAGCAGGTGATACATC AAATCAGATGCCTCGTTGGTCAGTTCAAAGCGGTCATGTACTGTCGCTGCCAGTGCGGTT

>ENA|AB000178|AB000178.1 Escherichia coli DNA for mannosyl transferase, phosphoribosyl-ATP pyrophosphohydrolase:phosphoribosyl-AMP cyclohydrolase, partial cds. GACCATATGATTGACGCCTATGTCAATCTCTACACTACATTGCTGGAAAGCAAATCCTGA GAGATGCTACCCCCGCCGTTGCTGCGGGGGCCAATGCGTTAATGCCGATTCTTCAGATTA TCAATCACTTCTCCGAGATCCAGCCCCTGATCCTGTAACAGCACCAGCAGGTGATACATC AAATCAGATGCCTCGTTGGTCAGCTCAAAGCGGTCATGTACCGTTGGTGCCAGTGCGGTT

To keep only the species name in the header as follows:

>Escherichia coli GACCATATGATTGACGCCTATGTCAATCTCTACACTACATTGCTGGAAAGCAAATCCTGA GAGATGCTACCCCCGCCGTTGCTGCGGGGGCCAACGCGTTAATGCCGATTCTTCAGATTA TCAATCACTTCTCCGAGATCCAGCCCTTGATCCTGCAACAACACCAGCAGGTGATACATC AAATCAGATGCCTCGTTGGTCAGTTCAAAGCGGTCATGTACTGTCGCTGCCAGTGCGGTT

>Escherichia coli GACCATATGATTGACGCCTATGTCAATCTCTACACTACATTGCTGGAAAGCAAATCCTGA GAGATGCTACCCCCGCCGTTGCTGCGGGGGCCAATGCGTTAATGCCGATTCTTCAGATTA TCAATCACTTCTCCGAGATCCAGCCCCTGATCCTGTAACAGCACCAGCAGGTGATACATC AAATCAGATGCCTCGTTGGTCAGCTCAAAGCGGTCATGTACCGTTGGTGCCAGTGCGGTT


Solution

  • Before:

    $ sed -n l test.fasta
    >ENA|AB000176|AB000176.1 Escherichia coli DNA for mannosyl transferase$
    GACCATATGATTGACGCCTATGTCAATCTCTACACTACATTGCTGGAAAGCAAATCCTGA GAGATGCTA$
    $
    >ENA|AB000178|AB000178.1 Escherichia coli DNA for mannosyl transferase$
    GACCATATGATTGACGCCTATGTCAATCTCTACACTACATTGCTGGAAAGCAAATCCTGA GAGATGCTA$
    

    After:

    $ sed '/^>/{ s/[^ ]* />/; s/ DNA.*//; s/ gene.*//; }' test.fasta
    >Escherichia coli
    GACCATATGATTGACGCCTATGTCAATCTCTACACTACATTGCTGGAAAGCAAATCCTGA GAGATGCTA
    
    >Escherichia coli
    GACCATATGATTGACGCCTATGTCAATCTCTACACTACATTGCTGGAAAGCAAATCCTGA GAGATGCTA