Search code examples
pythonbloom-filter

BloomFilter Python


I am new to python and trying to create a bloomFilter based on Bit torrent BEP 33. I have created Bloom Filter but it is not exactly what I am looking for. Here's what I need and I have not completely understood this situation. If someone here can explain ...

//fixed parameters

k = 2

m = 256*8

//the filter
byte[m/8] bloom   ## What is this part?

function insertIP(byte[] ip) {

    byte[20] hash = sha1(ip)

    int index1 = hash[0] | hash[1] << 8
    int index2 = hash[2] | hash[3] << 8

    // truncate index to m (11 bits required)
    index1 %= m  ## ?
    index2 %= m  ## ?

    // set bits at index1 and index2
    bloom[index1 / 8] |= 0x01 << index1 % 8   ## ??
    bloom[index2 / 8] |= 0x01 << index2 % 8   ## ??
 }

 // insert IP 192.168.1.1 into the filter:
 insertIP(byte[4] {192,168,1,1})

And this is what I have created

import hashlib
m = 2048
def hashes(s):
    index = [0, 0]
    #for c in s:
        #o = ord(c)
    index[0] = hashlib.sha224(index[0]).hexdigest ## This needs integer hash
    index[1] = hashlib.sha224(index[1]).hexdigest ## same as above 

    return [x % m for x in index]

class BloomFilter(object):
    def __init__(self):
        self.bitarray = [0] * m

    def add(self, s):
        for x in hashes(s):
            self.bitarray[x] = 1
        #print self.bitarray
    def query(self, s):
        return all(self.bitarray[x] == 1 for x in hashes(s))

shazib=BloomFilter()
shazib.add('192.168.0.1')
print shazib.query('192.168.0.1')

Solution

  • First, an explanation of the code...

    //fixed parameters
    
    k = 2
    

    This is the most baffling line to me; k isn't used at all...

    m = 256*8
    

    This is the number of bits in 256 bytes.

    //the filter
    byte[m/8] bloom   ## What is this part?
    

    bloom is an array of 256 bytes, i.e. 256 * 8 bits, i.e. m bits. Each bit in bloom will contain information about what values are in the filter.

    function insertIP(byte[] ip) {
    
        byte[20] hash = sha1(ip)
    

    This creates a 20-byte hash of ip.

        int index1 = hash[0] | hash[1] << 8
        int index2 = hash[2] | hash[3] << 8
    

    These two lines calculate two indices into bloom based on the hash. Basically, index1 is the concatenation of the first two bytes of hash, and index2 is the concatenation of the second two bytes of hash.

        // truncate index to m (11 bits required)
        index1 %= m  ## ?
        index2 %= m  ## ?
    

    These two lines truncate the values so that they don't exceed the range of possible indices into bloom. The % is the mod operator; it returns the remainder after division. (17 % 4 = 1, 22 % 5 = 2 and so on.) Remember that bloom is 256 * 8 bits long? Eleven bits allows us to encode 2 ** 11 possible indices, i.e. 2048 values, i.e. 256 * 8 values.

        // set bits at index1 and index2
        bloom[index1 / 8] |= 0x01 << index1 % 8   ## ??
        bloom[index2 / 8] |= 0x01 << index2 % 8   ## ??
    

    We're treating bloom as a bit-array, so we have to do some bit-twiddling to access the correct bit. First, divide indexA by 8, to get the correct byte, then truncate indexA using the % operator to get the correct bit within that byte.

    }
    
    // insert IP 192.168.1.1 into the filter:
    insertIP(byte[4] {192,168,1,1})
    

    And voila, we have a bloom filter. If you printed it out bitwise, it would look like this:

    data->    001011000101110011000001001000100...
    
    indices-> 000000000011111111112222222222333...
              012345678901234567890123456789012...
    

    And if a particular i.p., when hashed, generates an index1 of 5 and an index2 of 9, then it would be considered "in" the filter, because the bits at indices 5 and 9 are set to 1. Of course, there can be false positives, because multiple different values could result in the same indices; but there can be no false negatives.

    import hashlib
    m = 2048
    def hashes(s):
        index = [0, 0]
        #for c in s:
            #o = ord(c)
        index[0] = hashlib.sha224(index[0]).hexdigest ## This needs integer hash
        index[1] = hashlib.sha224(index[1]).hexdigest ## same as above 
    

    Here's your first problem. index[0] and index[1] need to be integers. Also, hashlib.sha224(index[0]).hexdigest returns a method. You have to call the method to get anything out of it, like this: hashlib.sha224(index[0]).hexdigest(). Also, if you want this to work in a way that's identical to the above code, you could convert the hash to an int (you can use int(x, 16) to convert a hexadecimal string into an integer) and then extract the first two bytes using & 65535, then shift it by two bytes using >> 16, then extract those two bytes using & 65535 again. Once you've got that correct, the rest works.

        return [x % m for x in index]
    
    class BloomFilter(object):
        def __init__(self):
            self.bitarray = [0] * m
    
        def add(self, s):
            for x in hashes(s):
                self.bitarray[x] = 1
            #print self.bitarray
        def query(self, s):
            return all(self.bitarray[x] == 1 for x in hashes(s))
    
    shazib=BloomFilter()
    shazib.add('192.168.0.1')
    print shazib.query('192.168.0.1')