Search code examples
awkhtml-parsing

Trying to find a match of two strings in a file using awk


Hello I am trying to find a pattern match on some HTML files using AWK but i dont seem to have any luck with it

So for my pattern to match it should have the following

<tr>
                    <td>Failures</td>
                    <td>0</td>
                </tr>
                <tr>
                    <td>Warnings</td>
                    <td>4</td>
                </tr>
                <tr>
                    <td>Errors</td>
                    <td>0</td>
                </tr>
                <tr>
                    <td>Not Applicable</td>
                    <td>53</td>
                </tr>
                <tr>
                    <td>Manual Checks</td>
                    <td>9</td>
                </tr>

Failures and Manual Checks should be zero. So in the above file failures is 0 and manual check is 9. So i need to match only when failure is 0 and manual check is 0.

SO i tried with and without escaping the new line but awk is not returning any results.

find . -name "*.html" -exec awk '/td\>Failures\<\/td\>\\n.*\<td\>0/ {print FILENAME}' '{}' \;

I have also tried other combinations like below but cant seem to figure out why awk is not going to the next line.

find . -name "*.html" -exec awk '/td\>Failures\<\/td\>\\n\[\^\\\<\]\+\<td\>0/ {print FILENAME}' '{}' \;

Can anyone please have a look and tell me what i am missing?


Solution

  • A more reliable solution is going to be based on a tool designed to parse html; having said that ...

    One awk idea using a couple custom regex patterns:

    $ cat regex.awk
    BEGIN { RS="^$"                                                 # whole file treated as a single line of input
            regex1="<td>Manual Checks</td>[[:space:]]+<td>0</td>"
            regex2="<td>Failures</td>[[:space:]]+<td>0</td>"
          }
    $0 ~ regex1 && $0 ~ regex2 {print FILENAME}
    

    NOTE: placing the code in a file (regex.awk) will make the follow-on find/awk quite a bit cleaner

    Sample input:

    $ cat f1.html
    ... snip ...
                        <td>Failures</td>
                        <td>0</td>                         # match
    ... snip ...
                        <td>Manual Checks</td>
                        <td>9</td>                         # not a match
    ... snip ...
    
    $ cat f2.html
    ... snip ...
                        <td>Failures</td>
                        <td>0</td>                         # match
    ... snip ...
                        <td>Manual Checks</td>
                        <td>0</td>                         # match
    ... snip ...
    

    NOTE: comments added for clarification; comments to not exist in the actual files

    Adding this to a find call:

    $ find . -name "f?.html" -exec awk -f regex.awk '{}' \;                                                              
    ./f2.html