Search code examples
pythondna-sequence

Trying to find substrings in large string


Given this string

dna3 = "CATGTAATAGATGAATGACTGATAGATATGCTTGTATGCTATGAAAATGTGAAATGACCC"

the following code should print the following 4 substrings.

ATGTAA
ATGAATGACTGATAG
ATGCTATGA
ATGTGA

However, it is printing the following:

ATGTAA
ATGAATGACTGATAG
ATGACTGATAGATATGCTTGTATGCTATGAAAATGTGAAATGACCC
ATGCTTGTATGCTATGAAAATGTGAAATGACCC
ATGCTATGA
ATGAAAATGTGA
ATGTGA
ATGACCC
None

May someone please help me figure this out. Thank you.

def findStopIndex(dna,index):

    stop1 = dna.find("tga",index)
    if(stop1 == -1 or (stop1-index) % 3 != 0):
        stop1 = len(dna)

    stop2 = dna.find("taa",index)
    if(stop2 == -1 or (stop2-index) % 3 != 0):
        stop2 = len(dna)

    stop3 = dna.find("tag",index)
    if(stop3 == -1 or (stop3-index) % 3 != 0):
        stop3 = len(dna)

    return min(stop1, min(stop2,stop3))  

def printAll(dna):
    gene = None
    start = 0
    while(True):
        loc = dna.find("atg", start)
        if(loc == -1):break
        stop = findStopIndex(dna,loc+3)
        gene = dna[loc:stop+3]
        print gene.upper()
        start = loc + 3


print printAll(dna3.lower())

Solution

  • We may need some additional informations regarding DNA structure. From what you described, it feels like the substrings can't overlap each other. In this case, you need to replace start = loc + 3 by start = stop + 3 (the characters seem to be grouped 3 by 3, also based and what you described).

    Finally, you don't need the print in print printAll(dna3.lower()), since it shows the None at the end of your result set (the function doesn't have a return value).

    With those modifications, my output is :

    ATGTAA
    ATGAATGACTGATAG
    ATGCTTGTATGCTATGAAAATGTGAAATGACCC
    

    Hope it'll be helpful.