Search code examples
sqlregexbashawkstring-parsing

AWK FPAT not working as expected for string parsing


I have to parse a very large length string (from stdin). It is basically a .sql file. I have to get data from it. I am working to parse the data so that I can convert it into csv. For this, I am using awk. For my case, A sample snippet (of two records) is as follows:

b="(abc@xyz.com,www.example.com,'field2,(2)'),(dfr@xyz.com,www.example.com,'field0'),"
echo $b|awk 'BEGIN {FPAT = "([^\\)]+)|('\''[^'\'']+'\'')"}{print $1}'

In my regex, I am saying that split on ")" bracket or if single quotes are found then ignore all text until last quote is found. But my output is as follows:

(abc@xyz.com,www.example.com,'field2,(2

I am expecting this output

(abc@xyz.com,www.example.com,'field2,(2)'

Where is the problem in my code. I am search a lot and check awk manual for this but not successful.


Solution

  • My first answer below was wrong, there is an ERE for what you're trying to do:

    $ echo "$b" | awk -v FPAT="[(]([^)]|'[^']*')*)" '{for (i=1; i<=NF; i++) print $i}'
    (abc@xyz.com,www.example.com,'field2,(2)')
    (dfr@xyz.com,www.example.com,'field0')
    

    Original answer, left as a different approach:

    You need a 2-pass approach first to replace all )s within quoted fields with something that can't already exist in the input (e.g. RS) and then to identify the (...) fields and put the RSs back to )s before printing them:

    $ echo "$b" |
    awk -F"'" -v OFS= '
        {
            for (i=2; i<=NF; i+=2) {
                gsub(/)/,RS,$i)
                $i = FS $i FS
            }
            FPAT = "[(][^)]*)"
            $0 = $0
            for (i=1; i<=NF; i++) {
                gsub(RS,")",$i)
                print $i
            }
            FS = FS
        }
    '
    (abc@xyz.com,www.example.com,'field2,(2)')
    (dfr@xyz.com,www.example.com,'field0')
    

    The above is gawk-only due to FPAT (or we could have used gawk patsplit()), with other awks you'd used a while-match()-substr() loop:

    $ echo "$b" |
    awk -F"'" -v OFS= '
        {
            for (i=2; i<=NF; i+=2) {
                gsub(/)/,RS,$i)
                $i = FS $i FS
            }
            while ( match($0,/[(][^)]*)/) ) {
                field = substr($0,RSTART,RLENGTH)
                gsub(RS,")",field)
                print field
                $0 = substr($0,RSTART+RLENGTH)
            }
        }
    '
    (abc@xyz.com,www.example.com,'field2,(2)')
    (dfr@xyz.com,www.example.com,'field0')