Since this must be quite a common scenario I'm wondering whether there are any pre-existing solutions to the following.
Lets say I have a set of N strings, and I am computing distances the distances between them. In this case it's a Hamming distance, but that's not really important.
If I wanted to make this as quick as possible, I would avoid self comparisons like so:
def hamming_distance(string1, string2):
"""Return the Hamming distance between equal-length sequences"""
if len(string1) != len(string2):
raise ValueError("Undefined for sequences of unequal length")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
ratios=[]
for a, i in enumerate(string_list):
for b, j in enumerate(string_list):
if a == b: # Avoid self comparisons for speed
break
ratios.append(hamming_distance(string_list[i], string_list[j]))
return ratios
But since this is 'symmetric', I could also throw away any reciprocal comparisons which would increase the speed if the strings were numerous and/or large.
Is there a generally accepted/elegant way of doing this in the above set up?
I'm also aware that in general it's advised to avoid nested loops as they can be slow - so if there is a better way of achieving this pairwise iteration over lists (maybe something in collections
?) and incorporating the avoidance of self and reciprocal comparisons, I'm all ears.
You can limit the nested for to start from the next item to the current item in the outer loop. In that way, you only run through each unique once:
for i, s1 in enumerate(string_list):
for s2 in string_list[i+1:]:
ratios.append(hamming_distance(s1, s2))
return ratios
You could put this in a list comp.:
ratios = [(s1, s2, hamming_distance(s1, s2)) for i, s1 in enumerate(string_list)
for s2 in string_list[i+1:]]
You could put the strings in a tuple as part of the result like I've done in the list comp.