I have to parse a very large length string (from stdin). It is basically a .sql file. I have to get data from it. I am working to parse the data so that I can convert it into csv. For this, I am using awk. For my case, A sample snippet (of two records) is as follows:
b="(abc@xyz.com,www.example.com,'field2,(2)'),(dfr@xyz.com,www.example.com,'field0'),"
echo $b|awk 'BEGIN {FPAT = "([^\\)]+)|('\''[^'\'']+'\'')"}{print $1}'
In my regex, I am saying that split on ")" bracket or if single quotes are found then ignore all text until last quote is found. But my output is as follows:
(abc@xyz.com,www.example.com,'field2,(2
I am expecting this output
(abc@xyz.com,www.example.com,'field2,(2)'
Where is the problem in my code. I am search a lot and check awk manual for this but not successful.
My first answer below was wrong, there is an ERE for what you're trying to do:
$ echo "$b" | awk -v FPAT="[(]([^)]|'[^']*')*)" '{for (i=1; i<=NF; i++) print $i}'
(abc@xyz.com,www.example.com,'field2,(2)')
(dfr@xyz.com,www.example.com,'field0')
Original answer, left as a different approach:
You need a 2-pass approach first to replace all )
s within quoted fields with something that can't already exist in the input (e.g. RS) and then to identify the (...)
fields and put the RSs back to )
s before printing them:
$ echo "$b" |
awk -F"'" -v OFS= '
{
for (i=2; i<=NF; i+=2) {
gsub(/)/,RS,$i)
$i = FS $i FS
}
FPAT = "[(][^)]*)"
$0 = $0
for (i=1; i<=NF; i++) {
gsub(RS,")",$i)
print $i
}
FS = FS
}
'
(abc@xyz.com,www.example.com,'field2,(2)')
(dfr@xyz.com,www.example.com,'field0')
The above is gawk-only due to FPAT (or we could have used gawk patsplit()
), with other awks you'd used a while-match()-substr() loop:
$ echo "$b" |
awk -F"'" -v OFS= '
{
for (i=2; i<=NF; i+=2) {
gsub(/)/,RS,$i)
$i = FS $i FS
}
while ( match($0,/[(][^)]*)/) ) {
field = substr($0,RSTART,RLENGTH)
gsub(RS,")",field)
print field
$0 = substr($0,RSTART+RLENGTH)
}
}
'
(abc@xyz.com,www.example.com,'field2,(2)')
(dfr@xyz.com,www.example.com,'field0')