Search code examples
bashcsvawkzsh

Awk: syntax for looping over associative array


I'm trying to parse a file, separated by a comma (CSV file). Imagine a list of books and their authors, the second column is the author. Since the file has millions of rows (around 133M), I cannot do this with Python or Java (I mean, I can, but it takes way too long), so I decided to use bash or zsh (which is the interpreter installed in the Mac).

I need to count how many books each author has, which means, counting the occurrences of each unique value on the second column.

#!/bin/zsh

awk -v FS="," 'NR > 1 {        
  author_id = $2;              
  count[$author_id]++;         
} 
END {                            
  PROCINFO["sorted_in_place"] = 1;  
  for (author_id, count) in count; {
    printf "%d, %d\n", author_id, count;  
  }
}' "~/list_of_books_per_author.csv" | sort -nrk2,2 | head -n 10

I keep getting this error:

awk: syntax error at source line 7
 context is
      for >>>  (author_id, <<<
awk: illegal statement at source line 7
awk: illegal statement at source line 7

I'm not really aware of what I'm doing wrong now. How do you iterate an associative array, when you want both values, key and value?


Solution

  • Awk is completely its own language, and completely distinct from both Bash and Zsh (which between themselves are also two distinct, incompatible languages, though related). However, if you are learning to use the shell, you will probably also want to learn at least the basics of Awk (and sed), too.

    Your attempt to loop over an associative array uses completely the wrong syntax. You want

      for (author_id in count) {
        printf "%d, %d\n", count[author_id], author_id;  
      }
    

    Note also how there should be no semicolon before the opening brace.

    I'm guessing you also meant

      count[author_id]++;
    

    earlier in the script. ($author_id would use the integer in author_id as the index into the fields; so if author_id is 3, you were doing count[$3]++)

    Depending on how complex your CSV file is, Awk may or may not be inadequate. It copes well with completely trivial CSV files, but is less ideal for complex ones with quoted literal commas and/or quoted literal newlines in the CSV data.

    You are probably doing something wrong if you think Awk is going to be much faster than compiled Java. Python is probably a bit slower, but none of them should be catastrophically slow when it comes to reading one line at a time and spitting out a result. (But if you mean it would take longer for you to write a quick program for this in Java, that's probably true.)

    As a further aside, you want ~/"list_of_books_per_author.csv"; at least in Bash, the tilde will not be expanded if it is in double quotes.