Search code examples
linuxtrimcut

How to remove the first three character from the fasta file header


I have a fasta file like this:

>rna-XM_00001.1 
actact
>rna-XM_00002.1
atcatc

How do I remove the 'rna-' so it become

>XM_00001.1 
actact
>XM_00002.1
atcatc

Solution

  • What you're showing is the file contents? Then sed should be able to do this:

    sed 's/^>rna-/>/' < inputfile > outputfile

    Explanation:

    • The first character of the command-line to sed is s, which tells sed to do substitution
    • The / are delimiters
    • The ^ tells sed to look only at the start of a line
    • The next >rna- is the pattern to match at the start of a line
    • The next > is the replacement substituted for the pattern

    If, instead, you want to always remove the first four characters after a > as long as they end in -, you could use:

    sed 's/^>...-/>/' < inputfile > outputfile

    Explanation:

    • This is similar to above, except the pattern to match at the start of a line is >...-. The pattern is a regexp, where a . matches any single character. So this pattern matches any line starting with >, followed by any three characters, followed by -.