Search code examples
unixfasta

Why is my regex not working to remove a section of a fasta header


I want to remove everything between the ">" and "Un_" in a heading such as

>NW_017859640.1 Esox lucius isolate CL-BC-CA-002 unplaced genomic scaffold, Eluc_V3 Un_scaffold1210

I've tried multiple iterations of regexes. Nothing that contains "*" seems to work

sed 's/^NC_*Eluc_V3 //' 

and using this pattern

sed 's/NC_*Eluc_V3 //'

What I would like in the end is

>Un_scaffold1210

Solution

  • Try with:

    sed 's/^>.*Un_/>Un_/'
    

    Here I'm searching for > at the beginning of the line, followed by things and ending with Un_, and substituting this string by just >Un_.

    Seems easier to look for what you told us that are your markers than trying to guess what should be inside those markers, as you are trying to do.