Search code examples
unixsortingcountlines

Compress EACH LINE of a file individually and independently of one another? (or preserve newlines)


I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.

Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?

Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?

Thanks for any help!


Solution

  • There are some possible solutions:

    1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes

    2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like: awk '!x[$0]++' oldfile > newfile

    3 - Why not split the files but with some criteria? Supposing all your lines begin with letters: - break your original_file in 20 smaller files: grep "^a*$" original_file > a_file - sort each small file: a_file, b_file, and so on - verify the duplicates, count them, do whatever you want.