Search code examples
bashsortingduplicateslarge-files

Deduplicating lines in large file fails with sort and uniq


I have a large file which consists of one line of JSON per line for 1563888 lines. To deduplicate the lines among this file I have been using the shell one-liner sort myfile.json | uniq -u.

For smaller files this approach worked and sort myfile.json | uniq -u | wc -l was larger than 0. Due to the file size now running sort myfile.json | uniq -u | wc -l produces 0 lines while if I use head -n500000 myfile.json | sort | uniq -u | wc -l works.

Is there an easy way for bash to handle such large files? Or is there a clean way to chunk the file up? I was using bash initially instead of Python as it seemed like an easier way to quickly verify things though now I am thinking about moving this task back to Python.


Solution

  • As per Kamil Cuk, let's try this solution:

    sort -u myfile.json 
    

    Is the file really JSON? Sorting a JSON file can lead to dubious results. You may also try split'ing the file as suggested by Mark Setchell. You can then sort each split file, and sort the results. All sorts should be done with sort -u.

    Please provide some sample from myfile.json if indeed it is a JSON file. Let's here about your results when you just use sort -u.