Search code examples
pythonhamming-distance

Finding the k-mer that results in minimum hamming distance in a list of dna


I am writing a program that given an input k=some number and DNA=a list of fragments of dna, the output should give a k-mer of size k that has the minimum hamming distance in the array of strings. I have three functions, 1. one that calculates the hamming distance between the k-mer and different windows of a fragment dna and returns the hamming distance of the window with the lowest score, 2. one that generates all possible k-mers of size k, and 3. one that iterates through all of the windows of size k and for every possible k-mer. Unfortunately, my program is giving me the output AAA, which is incorrect. I know that my logic error is not in combination(k) nor in hammingDistance, since I've used them before to get the correct result.

import itertools
def combination(k):
    bases=['A','T','G','C']
    combo=[''.join(p) for p in itertools.product(bases, repeat=k)]
    return combo

def hammingDistance(pattern, seq):
        if pattern == seq:
               return 0
        else:
                dist=0
                for i in range(len(seq)):
                        if pattern[i] != seq[i]:
                                dist += 1
        return dist

def median_string(k, DNA):
    k_mers = combination(k)
    distance = 0 
    temp = 1000000000000000000
    for string in DNA:
        hamming = 1000000000000000000
        c = 0
        for k_mer in k_mers:
            for subset in string[c: len(string) - k]:
                if hamming > hammingDistance(k_mer, string[c : c+k]):
                    hamming = hammingDistance(k_mer, string[c : c+k])
                c += 1
            distance += hamming
            if distance < temp:
                temp = distance
                best_pattern = k_mer
            distance = 0
    return best_pattern

Solution

  • It turned out to just be an indentation error in the last conditional.

    def median_string(k, DNA):
        k_mers = combination(k)
        distance = 0
        temp = 1000000000000000000
        for k_mer in k_mers:
            for string in DNA:
                hamming = 1000000000000000000
                c = 0
                for subset in string[c: len(string) - k]:
                    if hamming > hammingDistance(k_mer, string[c : c+k]):
                        hamming = hammingDistance(k_mer, string[c : c+k])
                    c += 1
                distance += hamming
            if distance < temp:
                    temp = distance
                    best_pattern=k_mer
            distance=0
        return best_pattern