Hy Py-guys :). Since I am new in the coding world and as well in Python, I don’t have much experience with coding and thus any help would be appreciated. I am working with short tandem repeats in DNA sequences and I would like to have a code that reads and counts the repeated nucleotides based on the tandem motif of specified loci.
Here is an example what I need:
tandem motif:
AGAT,AGAC,[AGAT],gat,[AGAT]
input:
TTAGTTCAGGATAGTAGTTGTTTGGAAGCGCAACTCTCTGAGAAACTTAGTTATTCTCTCATCTATTTAGCTACAGCAAACTTCATGTGACAAAAGCCACACCCATAACTTTTTTCCTCTAGATAGACAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGATTCTCTTTCTCTGCATTCTCATCTATATTTCTGTCTTTCTCTTAATTATGGGTAACTCTTAGCCTGCCAGGCTACCATGGAAAGACAACCTTTAT
analyzed input:
TTAGTTCAGGATAGTAGTTGTTTGGAAGCGCAACTCTCTGAGAAACTTAGTTATTCTCTCATCTATTTAGCTACAGCAAACTTCATGTGACAAAAGCCACACCCATAACTTTTTTCCTCTAGATAGACAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGATTCTCTTTCTCTGCATTCTCATCTATATTTCTGTCTTTCTCTTAATTATGGGTAACTCTTAGCCTGCCAGGCTACCATGGAAAGACAACCTTTAT
output:
AGAT AGAC (AGAT)2 GAT (AGAT)12
allele: 16
Description
That tandem motif is different for each locus so I need to manually specify it for one and every locus (about 130 loci in total).
So in this case whole motif begins with AGAT
and end with the last copy of AGAT
There is no unknown nucleotide (A/C/T/G) between those specified in tandem motif and everything what is before and after this defined motif should be ignored
As you can see, when in tandem motif there are nucleotides written in lower case (gat), they are not included in the final allele value
Those motifs that are in brackets, these can repeat multiple times
Those that are not in brackets – these have only one copy in the sequence
There can also be this case:
tandem motif:
[CTAT],CTAA,[CTAT],N30,[TATC]
input:
TTTGCATGATCTCTTCTTGATCATTTTCTTCCCCCTTTCCTAAAAAATTCTGGTCCTTTGAGGTAACTGCCATTACCATATGAGTTAGTCTGGGTTCTCCAGAGAAACAGAACCAATAGGCTATCTATCTAACTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTACTATCTCTATATTATCTATCTATCTATTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCATCTATCTATATCTTCTACCAAGTGATTTACTGTAATAAATTAGCTCATGCTATTATGGAGGATGAGTTCAAGATTTGTGGTCAGCAAGTTGCAGACTCA
analyzed input:
TTTGCATGATCTCTTCTTGATCATTTTCTTCCCCCTTTCCTAAAAAATTCTGGTCCTTTGAGGTAACTGCCATTACCATATGAGTTAGTCTGGGTTCTCCAGAGAAACAGAACCAATAGGCTATCTATCTAACTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTACTATCTCTATATTATCTATCTATCTATTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCATCTATCTATATCTTCTACCAAGTGATTTACTGTAATAAATTAGCTCATGCTATTATGGAGGATGAGTTCAAGATTTGTGGTCAGCAAGTTGCAGACTCA
output:
(CTAT)2 CTAA (CTAT)12 (TATC)13
allele: 28
Description
N30 means, that there are 30 unspecified nucleotides before final tandem repeat
Summary
There can be these types in motifs, which need to be defined, and each locus would have different combination of motifs:
Brackets: example [CTAT] – multiple copies of CTAT
No brackets: example CTAT – only one copy of CTAT
N#: example N30 - means 30 unspecified nucleotides (A/C/G/T)
Lower case: example ctat - means that these are not included in final allele number
Examples of real motifs:
[CTTT],TT,CT,[CTTT]
[TCTA],[TCTG],[TCTA],ta,[TCTA],tca,[TCTA],tccata,[TCTA],TA,[TCTA]
[TAGA],[CAGA],N48,[TAGA],[CAGA]
[AAGA],[AAGG],[AAGA]
and many more…
Thank you all in advance. Any help and ideas would be appreciated! :)
A good way to solve your problem is to use regex. REGular EXpression are a common way in programming to parse strings
.
With regex you can define which pattern you want to search in a string (almost like you did), which is the core of your problem.
This means that regex have their own formatting, similar, but not identical, to yours.
You can also write some code to convert your formatting to regex formatting, but you probably should write another question, avoiding all the DNA stuff.
Lets see how regex works:
Here is how your summary looks like in regex pattern:
Summary
There can be these types in motifs, which need to be defined, and each locus would have different combination of motifs:
Brackets: example [CTAT] – multiple copies of CTAT - RegEx:
(CTAT)+
(one or more) or(CTAT)*
(zero or more)No brackets: example CTAT – only one copy of CTAT - RegEx:
(CTAT){1}
N#: example N30 - means 30 unspecified nucleotides (A/C/G/T) - RegEx:
.{30}
Lower case: example ctat - means that these are not included in final allele number - RegEx:
(?:CTAT)
With this knowledge, we can apply our regex to our inputs:
example 1:
import re # import regex module
tandem = r"((AGAT){1}(AGAC){1}(AGAT)+(?:GAT){1}(AGAT)+)"
mystring = "TTAGTTCAGGATAGTAGTTGTTTGGAAGCGCAACTCTCTGAGAAACTTAGTTATTCTCTCATCTATTTAGCTACAGCAAACTTCATGTGACAAAAGCCACACCCATAACTTTTTTCCTCTAGATAGACAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGATTCTCTTTCTCTGCATTCTCATCTATATTTCTGTCTTTCTCTTAATTATGGGTAACTCTTAGCCTGCCAGGCTACCATGGAAAGACAACCTTTAT" #input string
analyzed_input = re.findall(tandem, mystring)[0]
print(analyzed_input) #see the match found
tot = 0
max_len = max((len(al) for al in analyzed_input[1:] if len(al) <= 4)) # longest allele, maximum 4
remaining_string = analyzed_input[0] #string to analyzed. will be cutted in for loop
for allele in analyzed_input[1:]: #for each allele
section = re.findall(r"((" + re.escape(allele) + ")+)", remaining_string)[0][0] # section where the allele is repeated
value = section.count(allele) if len(allele) == max_len else section.count(allele)*(len(allele)/10.0) # get the value of the alleles. /10.0 if allele is shorter than the longest allele found
remaining_string = remaining_string[remaining_string.index(section)+len(section):] # cut away from remaining string the current section
print("The value of allele {0} is {1}\n".format(allele, value))
if len(allele) <= 4: #add the allele value if his length is less than 5
tot += value
print("total allele number is: {0}".format(tot))
OUTPUT: total allele number is: 16
for the next examples, I'm only showing the regex tandem
, the rest of the code is the same
example 2:
tandem2 = r"((TCTA)+(TCTG)+(TCTA)+(?:TA){1}(TCTA)+(?:TCA){1}(TCTA)+(?:TCCATA){1}(TCTA)+(TA)+(TCTA)+)"
OUTPUT: total allele number is: 32.2
example 3:
tandem3 = r"((TCTA)+(TCTG)+(TCTA)+(?:TA){1}(TCTA)+(?:TCA){1}(TCTA)+(?:TCCATA){1}(TCTA)+(TA)*(TCTA)*)"
OUTPUT: total allele number is: 31.0
example 4:
tandem4 = r"((CTAT)+(CTAA){1}(CTAT)+(.{30})(TATC)+)"
OUTPUT: total allele number is: 28.0
your other example will be written as:
[CTTT],TT,CT,[CTTT] r"((CTTT)+(TT){1}(CT){1}(CTTT)+)"
[TCTA],[TCTG],[TCTA],ta,[TCTA],tca,[TCTA],tccata,[TCTA],TA,[TCTA] r"((TCTA)+(TCTG)+(TCTA)+(?:TA){1}(TCTA)+(?:TCA){1}(TCTA)+(?:TCCATA){1}(TCTA)+(TA){1}(TCTA)+)"
[TAGA],[CAGA],N48,[TAGA],[CAGA] r"((TAGA)+(CAGA)+(.{48})(TAGA)+(CAGA)+)"
[AAGA],[AAGG],[AAGA] r"((AAGA)+(AAGG)+(AAGA)+)"
Developing a full working framework will take a little bit of time, depending on what level of flexibility you want to achieve, input type, automation...