Search code examples
bashawkfasta

retaining text after delimiter in fasta headers using awk


I have what should be a simple problem, but my lack of awk knowledge is holding me back.

I would like to clean up the headers of a fasta file that is in this format:

>HWGG454_Clocus2_Locus3443_allele1
ATTCTACTACTACTCT
>GHW757_clocus37_Locus555662_allele2
CTTCCCTACGATG
>TY45_clocus23_Locus800_allele0
TTCTACTTCATCT

I would like to clean up each header (line starting with ">") to retain only the informative part, which is the second "_Locus*" with or without the allele part.

I thought awk would be the easy way to do this, but I cant quite get it to work.

If I wanted to retain just the first column of text up to the "_" delimiter for the header, and the sequences below, I run this (assuming this toy example is in the file test.fasta):

 cat test.fasta | awk -F '_' '{print $1}'

>HWGG454
ATTCTACTACTACTCT
>GHW757
CTTCCCTACGATG
>TY45
TTCTACTTCATCT

But, what I want is to retain just the "Locus*" text, which is after the 3rd delimiter, but, using this code I get this:

cat test.fasta | awk -F '_' '{print $3}'
Locus3443

Locus555662

Locus800

What am I doing wrong here?

thanks.


Solution

  • I understand this to mean that you want to pick the Locus field from the header lines and leave the others unchanged. Then:

    awk -F _ '/^>/ { print $3; next } 1' filename
    

    is perhaps the easiest way. This works as follows:

    /^>/ {      # in lines that begin with >
      print $3  # print the third field
      next      # and go to the next line.
    }
    1           # print other lines unchanged. Here 1 means true, and the
                # default action (unchanged printing) is performed.
    

    The thing to understand here is awk's control flow: awk code consists of conditions with associated actions, and the actions are performed if the condition evaluates to true.

    /^>/ is a regex match over the whole record (line by default); it is true if the line begins with > (because ^ matches the beginning), so

    /^>/ { print $3; next }
    

    will make awk execute print $3; next in lines that begin with >. The less straightforward part is

    1
    

    which prints lines unchanged. We only get here if the first action was not executed (because of the next in it), and this 1 is to be read as a condition that is always true -- nonzero values being true in awk.

    Now, if either the condition or the action in an awk statement are omitted, a default is used. The default action is printing the line unchanged, and this takes advantage of it. It would be equally possible to write

    1 { print }
    

    or

    { print }
    

    In the latter case, the condition is omitted and the default condition "true" is used. 1 is the shortest variant of this and idiomatic because of it.