Currently I have a large text file in the form of two columns. I am trying to print unique first columns with their sum as the output.
cat src
a 1
b 1
c 1
d 1
a 1
b 2
c 3
d 4
With basic awk I am able to achieve the desired output.
awk -F" " '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' src
a 2
b 3
c 4
d 5
Issue at hand is the process runs for a large amount of time, if we run the same with a large input file. So attempted to run the same with gnu-parallel and struck on there.
cat src | parallel --pipe awk -F" " '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}'
Any guidance on this would be much appreciated.
I found GNU datamash to be the fastest tool for standalone run in such case.
Test file (https://transfer.sh/hL5xL/file) has ~12M lines and size 116Mb.
Here's an extended time performance statistics:
$ du -sh inputfile
116M inputfile
$ wc -l inputfile
12520872 inputfile
$ time datamash -W -g1 sum 2 <inputfile > /dev/null
real 0m10.990s
user 0m10.388s
sys 0m0.216s
$ time awk '{ a[$1] += $2 }END{ for(i in a) print i, a[i] }' inputfile > /dev/null
real 0m12.361s
user 0m11.664s
sys 0m0.196s
$ time parallel -a inputfile --pipepart --block=11M -q awk '{ a[$1] += $2 }END{ for(i in a) print i, a[i] }' \
| awk '{ a[$1] += $2 }END{ for(i in a) print i, a[i] }' >/dev/null
real 0m8.660s
user 0m12.424s
sys 0m2.760s
For parallel approach use combination of parallel
+ awk
.
For the most recent datamash
version you may try:
parallel -a inputfile --pipepart --block=11M datamash -sW -g1 sum 2 | datamash -sW -g1 sum 2
As you see, GNU parallel
was used as the last approach, comprised of combination of 2awk
commands (one for aggregating intermediate results and another one for aggregating the final results).
The crucial GNU parallel
options here are:
--pipepart
Pipe parts of a physical file.--pipepart
works similar to--pipe
, but is much faster.
--block-size
size
Size of block in bytes to read at a time.
In my test case I've specified --block=11M
as ~10% of the main file size. In your case you may adjust it to --block=100M
.