I need to sort a really huge file several, hundred of Gb. Luckily I have access to a Linux MPI cluster. Does somebody know a good but most importantly working sort program which can run in distributed environment using MPI. Actually I want to count unique lines in that file so if somebody knows a program that does exactly that even better. Otherwise I can figure out how to do it myself later.
Because there was no no answer provided I though I would just share my results.
I downloaded nsort
program from ordinal.com (2004 winner in sortbenchmark.org annual sorting algorithm competition). It sorts amazingly fast though not in a cluster manner. I don't remember what was it anymore but I got huge time improvement using nsort
. I'm talking about tens of times more faster (maybe around ~50) than default linux sort.
Two more things to notice.