Best expression using parallel for sorting a large text file and printing the sum of the second column

Currently I have a large text file in the form of two columns. I am trying to print unique first columns with their sum as the output.

cat src   
a 1
b 1
c 1
d 1
a 1
b 2
c 3
d 4

With basic awk I am able to achieve the desired output.

awk -F" " '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' src
a 2
b 3
c 4
d 5

Issue at hand is the process runs for a large amount of time, if we run the same with a large input file. So attempted to run the same with gnu-parallel and struck on there.

cat src | parallel --pipe awk -F" " '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}'

Any guidance on this would be much appreciated.

Solution

I found GNU datamash to be the fastest tool for standalone run in such case.

Test file (https://transfer.sh/hL5xL/file) has ~12M lines and size 116Mb.

Here's an extended time performance statistics:

$ du -sh inputfile 
116M    inputfile

$ wc -l inputfile 
12520872 inputfile

$ time datamash -W -g1 sum 2 <inputfile > /dev/null
real    0m10.990s
user    0m10.388s
sys 0m0.216s

$ time awk '{ a[$1] += $2 }END{ for(i in a) print i, a[i] }' inputfile > /dev/null
real    0m12.361s
user    0m11.664s
sys 0m0.196s

$ time parallel -a inputfile --pipepart --block=11M -q awk '{ a[$1] += $2 }END{ for(i in a) print i, a[i] }' \
| awk '{ a[$1] += $2 }END{ for(i in a) print i, a[i] }' >/dev/null

real    0m8.660s
user    0m12.424s
sys 0m2.760s

For parallel approach use combination of parallel + awk.

For the most recent datamash version you may try:

parallel -a inputfile --pipepart --block=11M datamash -sW -g1 sum 2 | datamash -sW -g1 sum 2

As you see, GNU parallel was used as the last approach, comprised of combination of 2awk commands (one for aggregating intermediate results and another one for aggregating the final results). The crucial GNU parallel options here are:

--pipepart
Pipe parts of a physical file. --pipepart works similar to --pipe, but is much faster.

--block-size size
Size of block in bytes to read at a time.

In my test case I've specified --block=11M as ~10% of the main file size. In your case you may adjust it to --block=100M.