I need to "extract" certain columns from a CSV file. The list of columns to extract is long and their indices do not follow a regular pattern. So far I've come up with a regular expression for a comma-separated value but I find it frustrating that in the RHS side of sed's substitute command I cannot reference more than 9 saved strings. Any ideas around this?
Note that comma-separated values that contain a comma must be quoted so that the comma is not mistaken for a field delimiter. I'd appreciate a solution that can handle such values properly. Also, you can assume that no value contains a new line character.
With GNU awk:
$ cat file
a,"b,c",d,e
$ awk -vFPAT='([^,]*)|("[^"]+")' '{print $2}' file
"b,c"
$ awk -vFPAT='([^,]*)|("[^"]+")' '{print $3}' file
d
$ cat file
a,"b,c",d,e,"f,g,h",i,j
$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, -vcols="1,5,7,2" 'BEGIN{n=split(cols,a,/,/)} {for (i=1;i<=n;i++) printf "%s%s", $(a[i]), (i<n?OFS:ORS)}' file
a,"f,g,h",j,"b,c"
See http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content for details. I doubt if it'd handle escaped double quotes embedded in a field, e.g. a,"b""c",d
or a,"b\"c",d
.
See also What's the most robust way to efficiently parse CSV using awk? for how to parse CSVs with awk in general.