Search code examples
pythonruntimetimeithamming-distance

Interpreting Hamming Distance speed in python


I've been working on making my python more pythonic and toying with runtimes of short snippets of code. My goal to improve the readability, but additionally, to speed execution.

This example conflicts with the best practices I've been reading about and I'm interested to find the where the flaw in my thought process is.

The problem is to compute the hamming distance on two equal length strings. For example the hamming distance of strings 'aaab' and 'aaaa' is 1.

The most straightforward implementation I could think of is as follows:

def hamming_distance_1(s_1, s_2):
    dist = 0
    for x in range(len(s_1)):
        if s_1[x] != s_2[x]:  dist += 1
    return dist

Next I wrote two "pythonic" implementations:

def hamming_distance_2(s_1, s_2): 
    return sum(i.imap(operator.countOf, s_1, s_2))

and

def hamming_distance_3(s_1, s_2): 
    return sum(i.imap(lambda s: int(s[0]!=s[1]), i.izip(s_1, s_2)))  

In execution:

s_1 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
s_2 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
print 'ham_1  ',  timeit.timeit('hamming_distance_1(s_1, s_2)',  "from __main__ import s_1,s_2, hamming_distance_1",number=1000)
print 'ham_2  ',  timeit.timeit('hamming_distance_2(s_1, s_2)',  "from __main__ import s_1,s_2, hamming_distance_2",number=1000)
print 'ham_3  ',  timeit.timeit('hamming_distance_3(s_1, s_2)',  "from __main__ import s_1,s_2, hamming_distance_3",number=1000)

returning:

ham_1   1.84980392456
ham_2   3.26420593262
ham_3   3.98718094826

I expected that ham_3 would run slower then ham_2, due to the fact that calling a lambda is treated as a function call, which is slower then calling the built in operator.countOf.

I was surprised I couldn't find a way to get a more pythonic version to run faster then ham_1 however. I have trouble believing that ham_1 is the lower bound for pure python.

Thoughts anyone?


Solution

  • The key is making less method lookups and function calls:

    def hamming_distance_4(s_1, s_2):
        return sum(i != j for i, j in i.izip(s_1, s_2))
    

    runs at ham_4 1.10134792328 in my system.

    ham_2 and ham_3 makes lookups inside the loops, so they are slower.