I have a large but finite set of strings and it is very unlikely any two of these strings are the same, but that's exactly what I would like to check. All strings are about the same length +/- 1 character.
Let's assume as an example (but the numbers may change), I have a set of 30 billion strings, each about 30 characters long. In a naive approach, I would stuff all of them in a hash and check for collisions. That would be virtually O(n) runtime.
As memory is the limiting factor and there is no way I could stuff all strings into memory, I have to partition the dataset. Let's assume I can stuff 100 million strings in memory and checking another string against these 100 million is basically O(1) runtime.
How would my efficient algorithm (in terms of runtime) look like?
If you have N
strings and you can keep k
in memory, then you shall have N/k
partitions and each string shall be hashed only once but compared N/k - 1
times. Hence the complexity shall be O(N^2 / k)
.