Search code examples
javaalgorithmdata-structuresbloom-filter

I need to modify these hash functions so that it gives value in range 0<x<10^9


I was trying to implement bloom filter in java but the problem is that the hash functions are returning values in range -10^20

public long APHash(String str)
{
    long hash = 0xAAAAAAAA;
    for(int i = 0; i < str.length(); i++)
    {
        if ((i & 1) == 0)
        {
            hash ^= ((hash << 7) ^ str.charAt(i) * (hash >> 3));
        }
        else
             {
            hash ^= (~((hash << 11) + str.charAt(i) ^ (hash >> 5)));
        }
    }
    return hash;
}

Solution

  • Using Java Map or Set or similar data structures is missing the whole point of Bloom filters! E.g. a Java HashSet/HashMap entry is at least 20 bytes in a 32-bit JVM and 40 in a 64-bit one. The whole point of a Bloom filter is memory efficiency: one bit can represent more than one set element. So for a table that's 50% populated, you'd be trading away a factor of 160 (for the 64-bit table, only 80 for 32-bits) in memory efficiency! Bad idea.

    You should fix your hash functions. This is just a matter of choosing a prime number N for the Bloom bitmap size and computing the hash values mod N.

    If this is the signature of your current hash function:

    long APHash(String s) { ... }
    

    Then just use

    long getHashInRange(String s) { 
      return APHash(s) % PRIME_BITSET_SIZE;
    }
    

    If PRIME_BITSET_SIZE will fit in an int, then return int instead. An int index will allow for a table size (PRIME_BITSET_SIZE) of 2 billion.

    The prime-ness of the modulus makes pathological collision conditions less likely. This table of primes is a good resource for picking sizes.

    Don't use an array of byte (or anything larger) to store 1 or 0 in each element. Again, you'd be giving up a factor of 8 in memory use. For table sizes of 2^31-1 bits or less, use instead a BitSet or other appropriate container that stores each bit in a bit. If the table is at all large, be sure to create the BitSet with initial size PRIME_BITSET_SIZE rather than letting it grow dynamically. For larger tables - which require long indices - use the the Apache OpenBitSet class or similar.