Search code examples
awkparallel-processinggnu-parallel

Boost awk performance using GNU parallel


I have some subdirectories containing .csv.gz files. Using awk, I could manage to filter the files based on the values in column 1 and column 2 and dump the result in a single .csv.gz file.

 pigz -rdc /path/to/dir/ | awk -F, '{ if(($1>100) && ($2>100)) {print} }' | pigz > output.csv.gz

Thanks to pigz, the front and end of the bash pipe benefit from parallel processing. I'm wondering how can I use GNU parallel tool for executing awk jobs in parallel.


Solution

  • doit() {
      pigz -dc "$1" | awk -F, '{ if(($1>100) && ($2>100)) {print} }'
    }
    export -f doit
    
    find /path/to/dir -name '*.gz' | parallel doit | pigz > output.csv.gz