Search code examples
regexawkfasta

Change fasta header based on regex pattern


I have a fasta files with headers in two patterns like this

>256_Org1 
MAVVIIKDAADDSLARRD

>Org2_10005 
DSLARRDMAVVIIKDAA

I want to retain only the words and remove the numbers. I tried to use awk one liners suggested, but separating with delimiter '_' and following with {print $1} gives 256 (wrong) or Org2 (right). The output I expect is

>Org1 
MAVVIIKDAADDSLARRD

>Org2 
DSLARRDMAVVIIKDAA

In textwrangler, I can replace it in two steps, 1 with \>\d+\_ to > and 2 with \_\d+\n to \n. But I have several hundred files and would like to use a one-liner. Any suggestions?


Solution

  • With GNU sed:

    sed -E 's/^>[0-9]+_/>/; s/_[0-9]+ *$//' file
    

    Output:

    >Org1 
    MAVVIIKDAADDSLARRD
    
    >Org2
    DSLARRDMAVVIIKDAA