Search code examples
stringawkstring-parsingcommand-line-interfacehtml-content-extraction

Awk only processes first line of input file? Extract attribute values from HTML elements


I have a huge text file filled with HTML attributes. I only want the value of the tag. Ex:

<option value="API" datatype="string" datatype_value="0">API</option>
<option value="Account" datatype="string" datatype_value="0">Account</option>
<option value="Address - asn" datatype="string" datatype_value="0">Address - asn</option>

I only want "API" after 'option value'.

Right now I have this:

awk -F "option value=" '{print $2}' /inputFilePath | awk '{print $1}'

I works but ONLY on the first line of the file. So my out put when I run the command above on the file only returns:

"API"

And not "Account", "Address" or anything after.

Any thoughts on anything I could be doing wrong? Thanks in advance!


Solution

  • Modify RS instead:

    awk 'BEGIN { RS = "<option value=\"" ; FS = "\""; } NF { print $1 }' file
    

    Output:

    API
    Account
    Address - asn
    

    I just hope it works with your awk as nawk doesn't.

    Yet another using GNU awk:

    gawk '{ t = $0; while (match(t, /<option value="([^"]*)"(.*)/, a)) { print a[1]; t = a[2] } }' file
    

    Explicitly I used [^"]* since I find empty values still valid for your query but you can change that to [^"]+ if preferred.