Search code examples
stringfileawkline

Find multiple strings in a file scattered over multiple lines using awk


I am fairly new to awk, and was looking for a "oneliner" in awk that could help me find files that contain all three strings that are in a file, the strings will be on different lines in the file(s). I was able to gather from different sites this awk command, which looks for the strings "dna-advant", "vty 5 15" and "vty 16 31". Only if all three strings are found, I wanted it to print only the filename of the file, and that works. But, although it works, I do not understand how it works.

What is the use of FNR ==1? And I do not understand the last part either

... { r=1; print FILENAME; nextfile } END { exit 1-r }'

Can someone explain it to me? Is there a possible shorter way of doing the same? :-)

gawk 'FNR == 1 { s1 = s2 = s3 = 0 }
    /dna-advant/ { s1 = 1 }
    /vty 16 31/ { s2 = 1 } 
    /vty 5 15/ { s3 =1 }  
    s1 && s2 && s3 { r=1; print FILENAME; nextfile }
    END { exit 1-r }' k*

I tried using grep but I cannot use the grep -P on my system, since I am trying to get deeper into how awk functions, I really wanted it to work with this command, hopefully someone here can explain how this command works and possibly come up with a shorter version of it!


Solution

  • The FNR==1 means that the following piece of code gets executed on the first line of each file that is opened. That code just resets all the s1/s2/s3 flags as each file is opened.

    FNR is described here.


    The r thing is purely for setting the exit status. It sets 0 (success) if one or more files are found, or 1 (failure) if no files are found. This allows it to be used like this:

    if gawk ... k* ; then ...
    

    Or like this:

    gawk ...
    status=$?
    ...
    if [ $status -eq 1 ] ; then
       ...
    fi
    

    If you don't need/use that construct/functionality, you can omit all the stuff pertaining to r and the entire END block, but it doesn't cost you much performance and could be useful at some point.


    The code is already pretty efficient - it stops as soon as possible, reads as little as possible and is easy to read.

    You could probably build an alternative version with grep along these lines, but I wouldn't bother (and didn't):

    grep -l 'dna-advant' k* | xargs grep -l 'vty 16 31' | xargs grep -l 'vty 5 15'
    

    and optimize by searching for least likely string first.