Search code examples
exceptionsortinglarge-filesstack-dump

Sorting file with 1.8 million records using script


I am trying to remove identical lines in a file having 1.8 million records and create a new file. Using the following command:

sort tmp1.csv | uniq -c | sort -nr > tmp2.csv

Running the script creates a new file sort.exe.stackdump with the following information:

"Exception: STATUS_ACCESS_VIOLATION at rip=00180144805
..
..
program=C:\cygwin64\bin\sort.exe, pid 6136, thread main
cs=0033 ds=002B es=002B fs=0053 gs=002B ss=002B"

The script works for a small file with 10 lines. Seems like sort.exe cannot handle so many records. How do I work with such a large file with more than 1.8 million records? We do not have any database other than ACCESS and I was trying to do this manually in ACCESS.


Solution

  • The following awk command seemed to be a much faster way to get rid of the uniqe values:

    awk '!v[$0]++' $FILE2 > tmp.csv

    where $FILE2 is the file name with duplicate values.