I was trying to implement bloom filter in java but the problem is that the hash functions are returning values in range -10^20
public long APHash(String str)
{
long hash = 0xAAAAAAAA;
for(int i = 0; i < str.length(); i++)
{
if ((i & 1) == 0)
{
hash ^= ((hash << 7) ^ str.charAt(i) * (hash >> 3));
}
else
{
hash ^= (~((hash << 11) + str.charAt(i) ^ (hash >> 5)));
}
}
return hash;
}
Using Java Map or Set or similar data structures is missing the whole point of Bloom filters! E.g. a Java HashSet/HashMap entry is at least 20 bytes in a 32-bit JVM and 40 in a 64-bit one. The whole point of a Bloom filter is memory efficiency: one bit can represent more than one set element. So for a table that's 50% populated, you'd be trading away a factor of 160 (for the 64-bit table, only 80 for 32-bits) in memory efficiency! Bad idea.
You should fix your hash functions. This is just a matter of choosing a prime number N for the Bloom bitmap size and computing the hash values mod N.
If this is the signature of your current hash function:
long APHash(String s) { ... }
Then just use
long getHashInRange(String s) {
return APHash(s) % PRIME_BITSET_SIZE;
}
If PRIME_BITSET_SIZE will fit in an int
, then return int
instead. An int
index will allow for a table size (PRIME_BITSET_SIZE) of 2 billion.
The prime-ness of the modulus makes pathological collision conditions less likely. This table of primes is a good resource for picking sizes.
Don't use an array of byte
(or anything larger) to store 1 or 0 in each element. Again, you'd be giving up a factor of 8 in memory use. For table sizes of 2^31-1 bits or less, use instead a BitSet
or other appropriate container that stores each bit in a bit. If the table is at all large, be sure to create the BitSet
with initial size PRIME_BITSET_SIZE
rather than letting it grow dynamically. For larger tables - which require long
indices - use the the Apache OpenBitSet class or similar.