Search code examples
pythonalgorithmbigdatahamming-distance

Finding Minimum hamming distance of a set of strings in python


I have a set of n (~1000000) strings (DNA sequences) stored in a list trans. I have to find the minimum hamming distance of all sequences in the list. I implemented a naive brute force algorithm, which has been running for more than a day and has not yet given a solution. My code is

dmin=len(trans[0])
for i in xrange(len(trans)):
    for j in xrange(i+1,len(trans)):
            dist=hamdist(trans[i][:-1], trans[j][:-1])
            if dist < dmin:
                    dmin = dist

Is there a more efficient method to do this? Here hamdist is a function I wrote to find hamming distances. It is

def hamdist(str1, str2):
    diffs = 0
    if len(str1) != len(str2):
        return max(len(str1),len(str2))
    for ch1, ch2 in zip(str1, str2):
        if ch1 != ch2:
          diffs += 1
    return diffs

Solution

  • You could optimize your hamdist function by adding an optional parameter containing the minimum distance you have got so far, this way if diffs reaches that value you stop calculating the distance because this comparison will give you a greater distance than the minimum:

    def hamdist(str1, str2,prevMin=None):
        diffs = 0
        if len(str1) != len(str2):
            return max(len(str1),len(str2))
        for ch1, ch2 in zip(str1, str2):
            if ch1 != ch2:
                diffs += 1
                if prevMin is not None and diffs>prevMin:
                    return None
        return diffs 
    

    You will need to adapt your main loop to work with None return value from hamdist:

    dmin=len(trans[0])
    for i in xrange(len(trans)):
        for j in xrange(i+1,len(trans)):
                dist=hamdist(trans[i][:-1], trans[j][:-1])
                if dist is not None and dist < dmin:
                        dmin = dist