Search code examples
c++time-complexityspace-complexitymemory-limit

Building a set of unique lines in a file efficiently, without storing actual lines in the set


Recently I was trying to solve the following issue:

I have a very large file, containing long lines, and I need to find and print out all the unique lines in it.

I don't want to use a map or set storing the actual lines, as the file is very big and the lines are long, so this would result in O(N) space complexity with poor constants (where N is the number of lines). Preferably, I would rather generate a set storing the pointers to the lines in the files that are unique. Clearly, the size of such a pointer (8 bytes on 64 bit machine I believe) is generally much smaller than the size of line (1 byte per character I believe) in memory. Although space complexity is still O(N), the constants are much better now. Using this implementation, the file never needs to be fully loaded in memory.

Now, let's say I'll go through the file line by line, checking for uniqueness. To see if it is already in the set, I could compare to all lines pointed by the set so far, comparing character by character. This gives O(N^2*L) complexity, with L the average length of a line. When not caring about storing the full lines in the set, O(N*L) complexity can be achieved, thanks to hashing. Now, when using a set of pointers instead (to reduce space requirements), how can I still achieve this? Is there a neat way to do it? The only thing I can come up with is this approach:

  1. Hash the sentence. Store the hash value to map (or actually: unordered_multimap unordered to get the hashmap style, multi as double keys may be inserted in case of 'false matches').
  2. For every new sentence: check if its hash value is already in the map. If not, add it. If yes, compare the full sentences (new one and the one in the unordered map with same hash) character by character, to make sure there is no 'false match'. If it is a 'false match', still add it.

Is this the right way? Or is there a nicer way you could do it? All suggestions are welcome!

And can I use some clever 'comparison object' (or something like that, I don't know much about that yet) to make this checking for already existing sentences fully automated on every unordered_map::find() call?


Solution

  • Your solution looks fine to me since you are storing O(unique lines) hashes not N, so that's a lower bound.

    Since you scan the file line by line you might as well sort the file. Now duplicate lines will be contiguous and you need only check against the hash of the previous line. This approach uses O(1) space but you've got to sort the file first.