Search code examples
awk

Filtering multi string file according to multiple patterns


The following Awk script should print all strings from the input file with the exception of the strings that begin with "HETATM" and contain somewhere in the same string "lig" or "lih" patterns as well as all the strings that start from the "END". Finally it should add the "END" in the end of the filtered file:

awk '!/^HETATM/ && /lig|lih |^END/; END {print "END"}' test.pdb >> ./processed.pdb

but in fact it removes almost all lines producing an empty file with END at the end, thus filtering all strings. A possible sollution:

awk '!/^HETATM.*(lig|lih)/ && !/^END/; END {print "END"}'

Will it work correctly?


Solution

  • should add the "END" in the end of the filtered file

    Keep in mind that in GNU AWK END fires after all files are processed, as opposed to ENDFILE, simple example let file1.txt, file2.txt and file3.txt content be respectively 1, 2, 3 then

    awk '{print}ENDFILE{print "endfile"}END{print "end"}' file1.txt file2.txt file3.txt
    

    gives output

    1
    endfile
    2
    endfile
    3
    endfile
    end
    

    You might promptly ignore this detail if you have guarantee that always exactly 1 file will be rammed into your awk command.

    (tested in GNU Awk 5.1.0)