Search code examples
bashappendfasta

Appending text to specific patterns in a fasta BASH


I have a fasta with headers like this:

tr|Q7MX99|Q7MX99_PORGI_BACT

I would like them to say:

tr|Q7MX99|Q7MX99_PORGI_BACT_ORALMICROBIOME

So basically, whenever I have PORGI_BACT I want to append _ORALMICROBIOME to each instance.

I'm sure there is an easy fix through the terminal, but I can't seem to find it.

My first idea is to do something like:

sed 's/>.*/&_ORALMICROBIOME/' file.fa > outfile.fa

BUT I only want to add to specific header endings, and that is where I'm stuck.


Solution

  • You are almost close. Would you please try the following:

    sed 's/^>.*PORGI_BACT/&_ORALMICROBIOME/' file.fa > outfile.fa
    

    [Edit]
    According to the OP's requirement, how about:

    sed -E 's/^>.*(PORGI_BACT|HUMAN_MAM|TESTA_BACT)/&_ORALMICROBIOME/' file.fa > outfile.fa
    

    Sample input as file.fa:

    >SEQ0|tr|Q7MX99|Q7MX99_PORGI_BACT
    FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
    >SEQ1|tr|Q7MX88|Q7MX88_HUMAN_MAM
    KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
    LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
    >SEQ2|tr|Q7MX77|Q7MX77_TESTA_BACT
    EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
    >SEQ3|tr|Q7MX66|Q7MX66_DUMMY
    MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
    

    Output:

    >SEQ0|tr|Q7MX99|Q7MX99_PORGI_BACT_ORALMICROBIOME
    FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
    >SEQ1|tr|Q7MX88|Q7MX88_HUMAN_MAM_ORALMICROBIOME
    KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
    LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
    >SEQ2|tr|Q7MX77|Q7MX77_TESTA_BACT_ORALMICROBIOME
    EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
    >SEQ3|tr|Q7MX66|Q7MX66_DUMMY
    MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK