We have a requirement of reading/writing more than 10 million strings into a file. Also we do not want duplicates in the file. Since the strings would be flushed to a file as soon as they are read we are not maintaining it in memory.
We cannot use hashcode because of collisions in the hash code due to which we might miss a string as duplicate. Two other approaches i found in my googling:
1.Use a message digest algorithm like MD5 - but it might be too costly to calculate and store.
2.Use a checksum algorithm. [i am not sure if this produces a unique key for a string- can someone please confirm]
Is there any other approach avaiable. Thanks.
If you're okay with a microscopical risk of collisions, you could use some hash function such as MD5 as you suggest, and rely on the hashes.
Another alternative, possibly with a larger memory footprint, is to store the, already encountered strings, in a trie (a special type of tree).
Update: Yet another alternative, would be to use a Bloom filter. This however, still relies on hashing but can be adjusted to have an arbitrarily small probability of collisions.