I've been struggling with a particular bioinformatics problem for the last few days, and I was wondering if somebody could either find the fault in my logic or fault in my code (or both). The function is supposed to find all k-mers with a Hamming distance of at most d from all of the strings of DNA. This is what I'm trying to do:
Start with an iteration of all possible k-mers and compare them to each string
This implies I need another loop that goes through the strings of DNA as well
I make a while loop for c+k <= len(DNA[0])-1
. c+k
is my window of size k
, and I want to find at least one window in each string of DNA where my combo has a Hamming distance from that string equal to or less than an arbitrary d
. If the Hamming distance meets the criterion, then the while loop breaks, allowing the next string to be compared. If it doesn't, then the window is changed, and if c+k==len(DNA[0])-1
, and the Hamming distance still doesn't meet the criterion, I create a name error int(a)
and the exception kills the inner_loop
.
However, my function returns nothing except set()
, which I don't understand.
import itertools
def combination(k):
bases=['A','T','G','C']
combo=[''.join(p) for p in itertools.product(bases, repeat=k)]
return combo
def hammingDistance(Pattern, seq):
if Pattern == seq:
return 0
else:
dist=0
for i in range(len(seq)):
if Pattern[i] != seq[i]:
dist += 1
return dist
def motif_enumeration(k, d, DNA):
combos = combination(k)
global pattern
for combo in combos:
try:
inner_loop(k, d, DNA, combo)
except:
continue
return set(pattern)
def inner_loop(k, d, DNA, combo):
global pattern
for strings in DNA:
inner_loop_two(k, d, DNA, combo, strings)
def inner_loop_two(k, d, DNA, combo, strings):
global pattern
c=0
while c+k < len(DNA[0]):
print(combo, strings[c:c+k], hammingDistance(combo, strings[c:c+k]))
if d >= hammingDistance(combo, strings[c:c+k]) and strings == DNA[len(DNA)-1]:
#if we've reached the last string and the condition is met,
#that means that the combo is suitable for each string of DNA
pattern += [combo]
elif d >= hammingDistance(combo, strings[c:c+k]):
#condition is met for one string, now move onto next
break
elif d < hammingDistance(combo, strings[c:c+k]) and c+k == len(DNA[0])-1:
#Name error causes this inner loop two to crash, thus causing the first inner loop
#to pass
int(a)
elif d < hammingDistance(combo, strings[c:c+k]):
#change the window to see if the combo is valid later in the string
c += 1
pattern = []
DNA=['ATTTGGC', 'TGCCTTA', 'CGGTATC', 'GAAAATT']
print(motif_enumeration(3,1,DNA))
print(pattern)
I thought that since my second inner loop crashed, this would cause my first inner loop to pass, and then another combo in motif_enumeration
would be tested, but the first conditional in my inner_loop_two
never prints anything. I noticed also that when the inner loop crashes and motif_enumeration
continues, it continues for both the outer and the inner loop. Here's an example of what I mean...
AAA ATT 2
AAA TTT 3
AAA TTG 3
AAA TGG 3
AAT ATT 1
AAT TGC 3
AAT GCC 3
AAT CCT 2
AAT CTT 2
AAG ATT 2
AAG TTT 3
AAG TTG 2
AAG TGG 2
AAC ATT 2
AAC TTT 3
AAC TTG 3
AAC TGG 3
ATA ATT 1 etc...
My expected output is pattern=[ATA, ATT, GTT, TTT]
The core component of the logic is that we want to collect a combo into the pattern set if the combo matches at any position on all of the target strings. We can use Python's all
and any
functions to do this. Those functions work efficiently because they stop testing as soon as the result is decided.
import itertools
def combination(k):
return (''.join(p) for p in itertools.product('ATCG', repeat=k))
def hamming_distance(pattern, seq):
return sum(c1 != c2 for c1, c2 in zip(pattern, seq))
def window(s, k):
for i in range(1 + len(s) - k):
yield s[i:i+k]
def motif_enumeration(k, d, DNA):
pattern = set()
for combo in combination(k):
if all(any(hamming_distance(combo, pat) <= d
for pat in window(string, k)) for string in DNA):
pattern.add(combo)
return pattern
DNA = ['ATTTGGC', 'TGCCTTA', 'CGGTATC', 'GAAAATT']
print(motif_enumeration(3, 1, DNA))
output
{'GTT', 'ATA', 'TTT', 'ATT'}
I've made a few other changes to your code. We can calculate the Hamming distance efficiently by passing a generator to the sum
function. And we can save time & RAM by converting the combo tuples to strings with a generator rather than putting them into a list.
The motif_enumeration
function can be condensed even further into a set comprehension, but I must admit that it is rather dense, and even harder to read than the previous version. It may be slightly more efficient, though.
def motif_enumeration(k, d, DNA):
return {combo for combo in combination(k)
if all(any(hamming_distance(combo, pat) <= d
for pat in window(string, k)) for string in DNA)}
And here's a slightly more readable version, where I've given motif_enumeration
a helper function in_window
to perform the inner test.
# Return True if combo is within d in any window of string
def in_window(combo, string, k, d):
return any(hamming_distance(combo, pat) <= d for pat in window(string, k))
def motif_enumeration(k, d, DNA):
pattern = set()
for combo in combination(k):
if all(in_window(combo, string, k, d) for string in DNA):
pattern.add(combo)
return pattern