Search code examples
csvawkgawk

Csv optional quote with gawk


I need to use "FPAT" or the equivalent function "patsplit" of gawk. But it seems the installed version of gawk is 3.1.5 on our CentOs server.

I tried updating gawk with these commands:

yum update gawk;

And the server showed: "No Packages marked for Update"

I also tried reinstalling gawk with:

 yum install gawk;

server output: "Package gawk-3.1.5-15.el5.x86_64 already installed and latest version "

Where i need gawk 4.0 or above to use those FPAT OR patsplit. And why i need to use them? well i am trying to process a CSV file, and it seems the CSV file has optional quotation and embedded comma.

Example:

From a csv row like this:

this,is,a,"csv,with,embedded coma"

I need to split the fields like this:

this

is

a

"csv,with,embedded comma"

And here is the gawk code:

awk '{patsplit("this,is,a,\"csv,with,embedded comma\"",a,"([^,]*)|(\"([^\"]|\"\")+\"[^,]*)",seps); for(i=0;i<length(a);i++) print a[i];}';

Can anyone help me on this please ?


Solution

  • Try using csvquote in your pipeline to make the data easy for awk to interpret. This is a script I wrote that replaces the commas inside quoted fields with nonprinting characters, and then restores them.

    So if your awk command looked like this originally:

    awk -F, '{print $3 "," $5}' inputfile.csv
    

    ... it can be made to work with csv quoted separators like this:

    csvquote inputfile.csv | awk -F, '{print $3 "," $5}' | csvquote -u
    

    For code and more documentation, see https://github.com/dbro/csvquote