Search code examples
awktext-processingtext-parsing

How to use gawk to print out the 3rd column that is greater than 10 characters regardless of comma in the string


I have a csv file where some of the addresses have a comma in the middle, because of this I can't use

$ awk -F',' 'length($3) >= 10 {print $3}' schools.csv

an example of my data looks like this

id,name,address
"1","paul","103 avenue"
"2","shawn","108 BLVD, SE"
"3","ryan","MLK drive 1004"

as you can see the address for id two has a comma in between so I have to use gawk module 4. So far I've been able to print every row regardless if there is a comma or not but I only want to print the 3rd column(address) that has a field > 10 characters. Here is what I have thus far.

//awk.awk file
    BEGIN {
        FPAT = "([^,]+)|(\"[^\"]+\")"
    }
    
    {
        print "NF = ", NF
        for (i = 1; i <= NF; i++) {
            printf("$%d = <%s>\n", i, $i)
        }
    }
$ gawk -f awk.awk schools.csv

Desire output would just be

108 BLVD, SE or "108 BLVD, SE"


Solution

  • Well, as you are already using GNU awk, you could utilize gensub to remove leading and trailing double quotes for length:

    $ gawk 'BEGIN {
        FPAT = "([^,]*)|(\"[^\"]+\")" 
    }
    length(gensub(/^\"|\"$/,"","g",$3))>=10 {
        print $3
    }' file
    

    Output:

    "103 avenue"
    "108 BLVD, SE"
    "MLK drive 1004"
    

    If you want the output without the double quotes as well:

    {
        gsub(/^"|"$/,"",$3)
        if(length($3)>=10)
            print $3
    }