Search code examples
bashshelluniqgnu-toolchainlinux-toolchain

Finding a uniq -c substitute for big files


I have a large file (50 GB) and I could like to count the number of occurrences of different lines in it. Normally I'd use

sort bigfile | uniq -c

but the file is large enough that sorting takes a prohibitive amount of time and memory. I could do

grep -cfx 'one possible line'

for each unique line in the file, but this would mean n passes over the file for each possible line, which (although much more memory friendly) takes even longer than the original.

Any ideas?


A related question asks about a way to find unique lines in a big file, but I'm looking for a way to count the number of instances of each -- I already know what the possible lines are.


Solution

  • Use awk

    awk '{c[$0]++} END {for (line in c) print c[line], line}' bigfile.txt
    

    This is O(n) in time, and O(unique lines) in space.