Search code examples
regexcsvawksedgawk

Extract columns from a CSV file using Linux shell commands


I need to "extract" certain columns from a CSV file. The list of columns to extract is long and their indices do not follow a regular pattern. So far I've come up with a regular expression for a comma-separated value but I find it frustrating that in the RHS side of sed's substitute command I cannot reference more than 9 saved strings. Any ideas around this?

Note that comma-separated values that contain a comma must be quoted so that the comma is not mistaken for a field delimiter. I'd appreciate a solution that can handle such values properly. Also, you can assume that no value contains a new line character.


Solution

  • With GNU awk:

    $ cat file
    a,"b,c",d,e
    
    $ awk -vFPAT='([^,]*)|("[^"]+")' '{print $2}' file
    "b,c"
    $ awk -vFPAT='([^,]*)|("[^"]+")' '{print $3}' file
    d
    
    $ cat file
    a,"b,c",d,e,"f,g,h",i,j
    
    $ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, -vcols="1,5,7,2" 'BEGIN{n=split(cols,a,/,/)} {for (i=1;i<=n;i++) printf "%s%s", $(a[i]), (i<n?OFS:ORS)}' file
    a,"f,g,h",j,"b,c"
    

    See http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content for details. I doubt if it'd handle escaped double quotes embedded in a field, e.g. a,"b""c",d or a,"b\"c",d.

    See also What's the most robust way to efficiently parse CSV using awk? for how to parse CSVs with awk in general.