Search code examples
pythonalgorithmdictionaryhashmaphashtable

How to implement consistent hashing of integer type keys in Python custom dictionary?


I am trying to build a dictionary from scratch in Python. I have done most of the work but am stuck on a little problem. First, I will start by saying that I am using the inbuilt Python hash() to get the hash_result of the key (key can be int or str) then the index is formed by the hash_result % capacity of dictionary. If the key is a string of characters, everything works fine. As soon as the key is an integer, my custom dictionary breaks. Sometimes everything works, other times, the key gets hash value 0 (for instance) when adding to the dictionary, but the same key returns hash value 4 (for instance) when searching for the key in the dictionary which returns a KeyError since the key is mapped at index 0 and not 4. I believe that at first, the index is calculated by hash(key) % capacity(4 for instance), but as soon as the capacity gets increased x2, the index that is returned by the function hash(key) % capacity(now 8 because x2) is different which results in the problem. I saw this formula in Wikipedia (hash(key) % capacity). I am interested in learning if this is the problem I am facing or if not, what is actually causing this unwanted behavior and how to tackle it. Here is my code below:

    class MyDictionary:
    
    __LOAD_FACTOR_LIMIT = 0.75
    __DEFAULT_CAPACITY = 4
    
    def __init__(self):
        self.__capacity = self.__DEFAULT_CAPACITY
        self.__keys = [[] for i in range(self.__capacity)]
        self.__values = [[] for i in range(self.__capacity)]

    @property
    def keys(self):
        return [item for current_list in self.__keys for item in current_list]

    @property
    def values(self):
        return [value for value_list in self.__values for value in value_list]

    def __setitem__(self, key, value):
        while self.__compute_load_factor() >= self.__LOAD_FACTOR_LIMIT:
            self.__extend_dict()

        index_hash = self.__hash_function(key)

        if self.__is_key_in_dict(index_hash, key):
            self.__set_value_to_an_existing_key(index_hash, key, value)
            return

        self.__set_value_to_a_new_key(index_hash, key, value)

    def __getitem__(self, key):
        index_hash = self.__hash_function(key)
        if self.__is_key_in_dict(index_hash, key):
            index_bucket = self.__get_index_bucket(index_hash, key)
            return self.__values[index_hash][index_bucket]
        raise KeyError('Key is not in dictionary!')

    def __str__(self):
        key_values = zip(self.keys, self.values)
        result = '{' + ", ".join([f"{key}: {value}"
                                  if isinstance(key, int) else f"'{key}': {value}"
                                  for key, value in key_values]) + '}'
        return result

    def __hash_function(self, key):
        index_hash = hash(key) % self.__capacity
        return index_hash

    def __is_key_in_dict(self, index_hash, key):
        if key in self.__keys[index_hash]:
            return True
        return False

    def __get_index_bucket(self, index_hash, key):
        index_bucket = self.__keys[index_hash].index(key)
        return index_bucket

    def __extend_dict(self):
        self.__keys += [[] for i in range(self.__capacity)]
        self.__values += [[] for i in range(self.__capacity)]
        self.__capacity *= 2

    def __set_value_to_an_existing_key(self, index_hash, key, value):
        index_bucket = self.__get_index_bucket(index_hash, key)
        self.__values[index_hash][index_bucket] = value

    def __set_value_to_a_new_key(self, index_hash, key, value):
        self.__keys[index_hash].append(key)
        self.__values[index_hash].append(value)

    def __compute_load_factor(self):
        k = len(self.__keys)
        n = len([bucket for bucket in self.__keys if bucket])
        return n / k

    def get(self, key, return_value=None):
        try:
            index_hash = self.__hash_function(key)
            index_bucket = self.__get_index_bucket(index_hash, key)
            if self.__is_key_in_dict(index_hash, key):
                return self.__keys[index_hash][index_bucket]
            raise KeyError('Key is not in dictionary!')
        except KeyError:
            return return_value

    def add(self):
        pass

    def pop(self):
        pass

    def clear(self):
        self.__capacity = self.__DEFAULT_CAPACITY
        self.__keys = [[] for i in range(self.__capacity)]
        self.__values = [[] for i in range(self.__capacity)]

    def items(self):
        zipped_key_value = zip(self.keys, self.values)
        return [item for item in zipped_key_value]



dictionary = MyDictionary()
dictionary.add()
dictionary[4] = 'hey'
dictionary['2'] = 'cya'
dictionary['4'] = 'welcome'
dictionary['5'] = 'welcome'
dictionary['32'] = 'heya'
dictionary['31'] = 'heya'
dictionary['36'] = 'heya'
dictionary['34'] = 'heya'
print(dictionary[4])

Solution

  • This is because you increase the capacity (stored in the __capacity attribute) by calling the __extend_dict method when the load is over a threshold, which makes the indices of the buckets in which the existing values are stored no longer valid, since you always derive the indices by taking the modulo of the capacity.

    You should therefore re-insert the existing keys and values at their new indices every time you increase the dict's capacity:

    def __extend_dict(self):
        self.__capacity *= 2
        new_keys = [[] for _ in range(self.__capacity)]
        new_values = [[] for _ in range(self.__capacity)]
        for keys, values in zip(self.__keys, self.__values):
            for key, value in zip(keys, values):
                index_hash = self.__hash_function(key)
                new_keys[index_hash].append(key)
                new_values[index_hash].append(value)
        self.__keys = new_keys
        self.__values = new_values
    

    Demo: https://replit.com/@blhsing/NewEnchantingPerimeter