I have a set of 1,000,000 unique numbers. The numbers are in the interval between 0 an 50,000,000. Consider, the numbers are random. I need a data structure which would hold them all. The data-structure should require as little memory as possible. It should be possible to find quickly whether the number is in the set with no errors.
I found a solution with bloom filter. Yes, bloom filter has a probability of false positives, but since there are "just" 50,000,000 possible numbers, I can find all the errors and keep them in the std::set. By this method, I'm able to store all the numbers in 2.3MB of memory.
Can you find a better method?
Rather than a range of 0 to 50,000,000, how about 1,024 separate ranges of 65,536? That'd give you a 64 MB range. I suppose you can make it 763 rather than 1,024, which will give you 50,003,968.
Something like ushort[763][];
Now you're storing 1,000,000 16-bit values rather than 32-bit values.
The values in the rows are sorted. So to determine if a number is in the set, you divide the number by 763 to figure out which array to look in, and then do a binary search on number % 65536
.
Storage for the numbers themselves is 2,000,000 bytes. Plus a small amount of overhead for the arrays.
This will be faster in execution, smaller than your Bloom filter approach, no false positives, and a whole lot easier to implement.