From the run time of programs including the hash() operation on variable length strings I felt that it might be that the hash() function takes constant time to operate on different length strings. To verify my assumption I made the following strategy -
Hence if my guess that the hash() function is a constant time operation when operating on strings is correct, can you please explain in layman terms why is it so? A conceptual or theoretical explanation rather than a reference to a source-code would be preferable- as to how even large strings produce a hash instantaneously, indifferent of character length.
The following is the code implementation of above-mentioned strategy -
import random
import time
import pylab
def form_str(k, n):
"""
Form k-length strings of arbitrary characters, n in number.
"""
for i in range(n):
random_str = ''
for j in range(k):
nextchar = random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz')
random_str += nextchar
yield random_str
def time_hash(k, n):
"""
Find the total time to hash n strings of k length each.
"""
total_time = 0.0
for element in form_str(k, n):
start_time = time.time()
# Because hash() works almost instantaneously we repeat
# it 100,000 times to get a non-zero run time.
for i in range(100000):
hash(element)
end_time = time.time()
total_time += (end_time - start_time)
return round(total_time, 2)
# print(time_hash(100, 100))
def plot_time():
"""
Plots the time vs string length (k) over a range of k-values.
"""
x_values, y_values = [], []
for k in range(0, 100000, 5000):
x_values.append(k)
y_values.append(time_hash(k, 100))
# print(y_values)
pylab.figure(1)
pylab.title('Hash_Time_Complexity')
pylab.xlabel('Length of string')
pylab.ylabel('Time taken to hash string (100 x 100000 times)')
pylab.plot(x_values, y_values, 'r:', label = 'Hash time')
pylab.show()
plot_time()
# From the plot it is clear that indifferent of the string length
# hashing takes the same time.
Since strings are immutable, the hashcode of a string is computed only once and cached thereafter.
A better way to benchmark would be to generate different(unique) strings of length k and average their hash time, instead of calling hash of the same string multiple times.