Search code examples
bashawkgnu-parallel

Best expression using parallel for sorting a large text file and printing the sum of the second column


Currently I have a large text file in the form of two columns. I am trying to print unique first columns with their sum as the output.

cat src   
a 1
b 1
c 1
d 1
a 1
b 2
c 3
d 4

With basic awk I am able to achieve the desired output.

awk -F" " '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' src
a 2
b 3
c 4
d 5

Issue at hand is the process runs for a large amount of time, if we run the same with a large input file. So attempted to run the same with gnu-parallel and struck on there.

cat src | parallel --pipe awk -F" " '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}'

Any guidance on this would be much appreciated.


Solution

  • I found GNU datamash to be the fastest tool for standalone run in such case.

    Test file (https://transfer.sh/hL5xL/file) has ~12M lines and size 116Mb.

    Here's an extended time performance statistics:

    $ du -sh inputfile 
    116M    inputfile
    
    $ wc -l inputfile 
    12520872 inputfile
    
    $ time datamash -W -g1 sum 2 <inputfile > /dev/null
    real    0m10.990s
    user    0m10.388s
    sys 0m0.216s
    
    $ time awk '{ a[$1] += $2 }END{ for(i in a) print i, a[i] }' inputfile > /dev/null
    real    0m12.361s
    user    0m11.664s
    sys 0m0.196s
    
    $ time parallel -a inputfile --pipepart --block=11M -q awk '{ a[$1] += $2 }END{ for(i in a) print i, a[i] }' \
    | awk '{ a[$1] += $2 }END{ for(i in a) print i, a[i] }' >/dev/null
    
    real    0m8.660s
    user    0m12.424s
    sys 0m2.760s
    

    For parallel approach use combination of parallel + awk.

    For the most recent datamash version you may try:

    parallel -a inputfile --pipepart --block=11M datamash -sW -g1 sum 2 | datamash -sW -g1 sum 2
    

    As you see, GNU parallel was used as the last approach, comprised of combination of 2awk commands (one for aggregating intermediate results and another one for aggregating the final results). The crucial GNU parallel options here are:

    --pipepart
    Pipe parts of a physical file. --pipepart works similar to --pipe, but is much faster.

    --block-size size
    Size of block in bytes to read at a time.

    In my test case I've specified --block=11M as ~10% of the main file size. In your case you may adjust it to --block=100M.