Search code examples
bashshellcsvmultiple-columnscounting

How can I determine the number of fields in a CSV, from the shell?


I have a well-formed CSV file, which may or may not have a header line; and may or may not have quoted data. I want to determine the number of columns in it, using the shell.

Now, if I can be sure there are no quoted commas in the file, the following seems to work:

x=$(tail -1 00-45-19-tester-trace.csv | grep -o , | wc -l); echo $((x + 1))

but what if I can't make that assumption? That is, what if I can't assume a comma is always a field separator? How do I do it then?

If it helps, you're allowed to make the assumption of there being no quoted quotes (i.e. \"s between within quoted strings); but better not to make that one either.


Solution

  • If you cannot make any optimistic assumptions about the data, then there won't be a simple solution in Bash. It's not trivial to parse a general CSV format with possible embedded newlines and embedded separators. You're better off not writing that in bash, but using an existing proper CSV parse. For example Python has one built in its standard library.

    If you can assume that there are no embedded newlines and no embedded separators, than it's simple to split in commas using awk:

    awk -F, '{ print NF; exit }' input.csv
    

    -F, tells awk to use comma as the field separator, and the automatic NF variable is the number of fields on the current line.

    If you want to allow embedded separators, but you can assume no embedded double quotes, then you can eliminate the embedded separators with a simple filter, before piping to the same awk as earlier:

    head -n 1 input.csv | sed -e 's/"[^"]*"//g' | awk ...
    

    Note that both of these examples use the first line to decide the number of fields. If the input has a header line, this should work quite well, as the header should not contain embedded newlines