Search code examples
regexbashawkfasta

AWK script to check first line of a file and then print the rest


I am trying to write an AWK script to parse a file of the form

> field1 - field2 field3 ...
lineoftext
anotherlineoftext
anotherlineoftext

and I am checking using regex if the first line is correct (begins with a > and then has something after it) and then print all the other lines. This is the script I wrote but it only verifies that the file is in a correct format and then doesn't print anything.

#!/bin/bash
# FASTA parser

awk ' BEGIN { x = 0; }
{ if ($1 !~ />.*/ && x == 0)
    { print "Not a FASTA file"; exit; }
  else { x = 1; next; }
  print $0 }
END { print " - DONE - "; }'

Solution

  • Basically you can use the following awk command:

    awk 'NR==1 && /^>./ {p=1} p' file
    

    On the first row NR==1 it checks whether the line starts with a > followed by "something" (/^>./). If that condition is true the variable p will be set to one. The p at the end checks whether p evaluates true and prints the line in that case.

    If you want to print the error message, you need to revert the logic a bit:

    awk 'NR==1 && !/^>./ {print "Not a FASTA file"; exit 1} 1' file
    

    In this case the program prints the error messages and exits the program if the first line does not start with a >. Otherwise all lines gets printed because 1 always evaluates to true.