So recently I've been trying to write a program that detects and cuts out the coding part of a DNA sequence based on start and stop codons.
The eventual goal is to compare 2 sequences of both 240 nucleotides long, however one causes sickle-cell disease, so you want to see the difference between the coding parts that both result.
This is the code I've written so far which actually works with the sequence that comes with it.
sequence = "CCATGCTTGATCA"
sequence_list = list(sequence)
codon_list = ["ATG", "TAA", "TAG", "TGA"]
position_list = []
length_sequence = len(sequence)
length_codon = len(codon_list)
length_position = len(position_list)
n = length_sequence-1
while n >= 0:
for i in range(0, len(codon_list)):
codon_sub_list = list(codon_list[i])
if sequence_list[n] == codon_sub_list[2] and sequence_list[n-1] == codon_sub_list[1] and sequence_list[n-2] == codon_sub_list[0]:
position_list.append(n-2)
print(sequence_list[n], "@", n)
print(sequence_list[n-1], "@", n-1)
print(sequence_list[n-2], "@", n-2)
n-=1
print(len(position_list))
print(sequence[position_list[length_position-1]:(position_list[0]+3)])
Now, the results of that when I made it two days ago were very promising. It gave the following results, as expected:
A at location 9
G at location 8
T at location 7
G at location 4
T at location 3
A at location 2
[7,2]
ATGCTTGA
However, today I tried continuing on the work by trying a different sequence, this time one of the 240 nucleotides long sequences. Below are the two sequences and which one it is.
Sickle-cell disease sequence:GAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCT
Normal sequence: GAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCT
This however is the results I get from executing it, I will quickly list the nucleotides and their location as most of them are irrelevant and it's mostly the last ones that matter.
[G,211] [T,210] [A,209] [G,199] [A,198] [T,197] [A,187] [A,186] [T,185] [A,145] [G,144] [T,143] [A,133] [G,132] [T,131] [G,132] [T,131] [A,130] [A,123] [G,122] [T,121] [A,78] [G,77] [T,76] [G,68] [T,67] [A,66] [G,47] [A,46] [T,45] [A,29] [G,28] [T,27] [A,1] [G,0] [T,-1]
[209, 197, 185, 143, 131, 130, 121, 76, 66, 45, 27, -1]
No sequence, just an empty line
Now clearly the problem arises when at the last codon, TGA, it notes the location of the T as -1, I however have no clue what causes this and have tried adjusting several values to somehow get it working, which it didn't do in any of the cases.
I'm wondering what can be causing this and what to do about it ? Also, I made this two days ago mostly as a draft to get started with and there are likely other things that can be better, so excuses if anything seems somewhat sloppy, in my opinion the whole while-loop could be done better but at that moment I chose it because a different approach to loop through didn't work, can't really remember what exactly it was anymore.
Note: I made a screenshot of my IDLE output to give you an idea:
I found your original code a bit unclear (starting at right edge with searching is counterintuitive to me), so I tried to code an alternative. The most important changes are that I now traverse the sequence left-to-right, and search for the codons by comparing them to subsequences in one go, not nucleotide-by-nucleotide. Here is my code, with some hopefully helpful comments. Does this do what you need? If not, let me know.
sequence = "GAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCT"
codon_list = ["ATG", "TAA", "TAG", "TGA"]
# store the starting positions of the codons
found_codon_positions = []
# note that we can use len() and [] with strings as well, no need to
# convert to list first
n = len(sequence)
k = 0
while k < n-2:
# extract a three-nucleotide subsequence
possible_codon = sequence[k:k+3]
if possible_codon in codon_list:
found_codon_positions.append(k)
k += 1
print('found codons at indices {}'.format(found_codon_positions))
print('extracted sequence:')
print(sequence[found_codon_positions[0]:found_codon_positions[-1]+3])
Output:
found codons at indices [27, 45, 66, 76, 121, 130, 131, 143, 185, 197, 209]
extracted sequence:
TGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATG