Search code examples
bioinformaticsfastasequence-alignment

Removing text from a fasta gene name between two characters


I have a large codon alignment that has a variety of gene names in the headers. The headers are in the following format:

>ENST00000357033.DMD.-1 | CODON | REFERENC

I want to modify all of the headers in the fasta to exclude all characters after the first "." and before the first "|". Desired outcome:

>ENST00000357033 | CODON | REFERENC

I've tried a few sed commands, no dice. Any advice? I'm averse to using awk, since I'd like to keep the formatting of the alignment and awk scares me.

Thank you!


Solution

  • sed '/^>/s/\.[^ ]* / /'
    

    for each line starting with a '>' replace 'dot' followed by some char different from spaces followed by a space, by a space.