Search code examples
pythondna-sequence

Generating DNA sequence excluding specific sequence


I've just started to learn programming with python. In class we were asked to generate a random DNA sequence, that does NOT contain a specific 6-letter sequence (AACGTT). The point is to make a funtion that always return a legal sequence. Currently my function generates a correct sequence about 78% of the time. How can I make it return a legal sqeuence 100% of the time? Any help is appreciated.

Here is what my code looks like for now:

from random import choice
def generate_seq(length, enzyme):
    list_dna = []
    nucleotides = ["A", "C", "T", "G"]
    i = 0
    while i < 1000:
        nucleotide = choice(nucleotides)
        list_dna.append(nucleotide)
        i = i + 1

    dna = ''.join(str(nucleotide) for nucleotide in list_dna)
    return(dna) 


seq = generate_seq(1000, "AACGTT")
if len(seq) == 1000 and seq.count("AACGTT") == 0:
    print(seq)

Solution

  • One option is to check your last few entries in your loop and only keep appending if the 'bad' sequence hasn't been created. However, this may result in a higher than true-random chance of having the "AACGT" sequence, just with a different letter instead of the last "T"

    from random import choice
    def generate_seq(length, enzyme):
        list_dna = []
        nucleotides = ["A", "C", "T", "G"]
        i = 0
        while i < 1000:
            nucleotide = choice(nucleotides)
            list_dna.append(nucleotide)
            #check for invalid sequence. If found, remove last element and redraw
            if ''.join(list_dna[-6:]) == "AACGTT":
                list_dna.pop()
            else:
                i = i + 1
    
        dna = ''.join(str(nucleotide) for nucleotide in list_dna)
        return(dna) 
    
    
    seq = generate_seq(1000, "AACGTT")
    if len(seq) == 1000 and seq.count("AACGTT") == 0:
        print(seq)