I'm studying hash tables and I'm a little bit confused about this hash table implementation written in Python which is from textbook. I understand all of it beside one point where inside addEntry() method there is a for loop which does hashBucket[i] = (dictKey, dictVal)
if hashBucket[i][0] == dictKey
I'm a little bit confused about what it does.
class intDict(object):
def __init__(self, numBuckets):
self.buckets = []
self.numBuckets = numBuckets
for i in range(numBuckets):
self.buckets.append([])
def addEntry(self, dictKey, dictVal):
hashBucket = self.buckets[dictKey%self.numBuckets]
for i in range(len(hashBucket)):
if hashBucket[i][0] == dictKey:
hashBucket[i] = (dictKey, dictVal)
return
hashBucket.append((dictKey, dictVal))
def getValue(self, dictKey):
hashBucket = self.buckets[dictKey%self.numBuckets]
for e in hashBucket:
if e[0] == dictKey:
return e[1]
return None
def __str__(self):
result = '{'
for b in self.buckets:
for e in b:
result = result + str(e[0]) + ':' + str(e[1]) + ','
return result[:-1] + '}'
Take a step back, and think about a hash table. You just have an array of buckets, and the way you can get an index to determine what bucket you'll look at given a key is based of some hash function, f()
. Imagine you have an infinite amount of keys, but f()
can only produce, say ~2 billion unique hashes. Pigeon hole principal states that if you have n buckets and n+1 objects to put in the buckets, then at least one of the buckets will contain more than one object. You can imagine a worst case scenario, that f()
is a bad hash function and we put everything in the same bucket.
Really, there's no way to guarantee that a hash table won't have key collision for an infinitely large set of keys and a restricted set of hashes. So, why this?
for i in range(len(hashBucket)):
if hashBucket[i][0] == dictKey:
hashBucket[i] = (dictKey, dictVal)
return
That's accounting for the possibility that two or more keys may exist in the same bucket. Not only do we go to the index of the hashed value, but we must go through all values in that bucket to determine if it is the value we are looking for. If it is, we need to store that new value with a key as well, in some fashion. You still need to determine the key to determine key equality (because you cannot trust the hash to do so), and you'll need the value to make the data structure worthwhile, hence the (key, value)
tuple.