string awk string-parsing command-line-interface html-content-extraction

Awk only processes first line of input file? Extract attribute values from HTML elements

I have a huge text file filled with HTML attributes. I only want the value of the tag. Ex:

<option value="API" datatype="string" datatype_value="0">API</option>
<option value="Account" datatype="string" datatype_value="0">Account</option>
<option value="Address - asn" datatype="string" datatype_value="0">Address - asn</option>

I only want "API" after 'option value'.

Right now I have this:

awk -F "option value=" '{print $2}' /inputFilePath | awk '{print $1}'

I works but ONLY on the first line of the file. So my out put when I run the command above on the file only returns:

"API"

And not "Account", "Address" or anything after.

Any thoughts on anything I could be doing wrong? Thanks in advance!

Solution

Modify RS instead:

awk 'BEGIN { RS = "<option value=\"" ; FS = "\""; } NF { print $1 }' file

Output:

API
Account
Address - asn

I just hope it works with your awk as nawk doesn't.

Yet another using GNU awk:

gawk '{ t = $0; while (match(t, /<option value="([^"]*)"(.*)/, a)) { print a[1]; t = a[2] } }' file

Explicitly I used [^"]* since I find empty values still valid for your query but you can change that to [^"]+ if preferred.