Search code examples
pythonregexdna-sequence

Analyze tandem repeat motifs in DNA sequences


Hy Py-guys :). Since I am new in the coding world and as well in Python, I don’t have much experience with coding and thus any help would be appreciated. I am working with short tandem repeats in DNA sequences and I would like to have a code that reads and counts the repeated nucleotides based on the tandem motif of specified loci.

Here is an example what I need:


tandem motif:

AGAT,AGAC,[AGAT],gat,[AGAT]

input:

TTAGTTCAGGATAGTAGTTGTTTGGAAGCGCAACTCTCTGAGAAACTTAGTTATTCTCTCATCTATTTAGCTACAGCAAACTTCATGTGACAAAAGCCACACCCATAACTTTTTTCCTCTAGATAGACAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGATTCTCTTTCTCTGCATTCTCATCTATATTTCTGTCTTTCTCTTAATTATGGGTAACTCTTAGCCTGCCAGGCTACCATGGAAAGACAACCTTTAT

analyzed input:

TTAGTTCAGGATAGTAGTTGTTTGGAAGCGCAACTCTCTGAGAAACTTAGTTATTCTCTCATCTATTTAGCTACAGCAAACTTCATGTGACAAAAGCCACACCCATAACTTTTTTCCTCTAGATAGACAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGATTCTCTTTCTCTGCATTCTCATCTATATTTCTGTCTTTCTCTTAATTATGGGTAACTCTTAGCCTGCCAGGCTACCATGGAAAGACAACCTTTAT

output:

AGAT AGAC (AGAT)2 GAT (AGAT)12 
  • number of copies. (In output GAT is in upper case even if it doesn’t count viz Description)

allele: 16

  • total number of copies of each motif (1 + 1 + 2 + 12)

Description

That tandem motif is different for each locus so I need to manually specify it for one and every locus (about 130 loci in total).

So in this case whole motif begins with AGAT and end with the last copy of AGAT

There is no unknown nucleotide (A/C/T/G) between those specified in tandem motif and everything what is before and after this defined motif should be ignored

As you can see, when in tandem motif there are nucleotides written in lower case (gat), they are not included in the final allele value

Those motifs that are in brackets, these can repeat multiple times

Those that are not in brackets – these have only one copy in the sequence


There can also be this case:


tandem motif:

[CTAT],CTAA,[CTAT],N30,[TATC]

input:

TTTGCATGATCTCTTCTTGATCATTTTCTTCCCCCTTTCCTAAAAAATTCTGGTCCTTTGAGGTAACTGCCATTACCATATGAGTTAGTCTGGGTTCTCCAGAGAAACAGAACCAATAGGCTATCTATCTAACTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTACTATCTCTATATTATCTATCTATCTATTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCATCTATCTATATCTTCTACCAAGTGATTTACTGTAATAAATTAGCTCATGCTATTATGGAGGATGAGTTCAAGATTTGTGGTCAGCAAGTTGCAGACTCA

analyzed input:

TTTGCATGATCTCTTCTTGATCATTTTCTTCCCCCTTTCCTAAAAAATTCTGGTCCTTTGAGGTAACTGCCATTACCATATGAGTTAGTCTGGGTTCTCCAGAGAAACAGAACCAATAGGCTATCTATCTAACTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTACTATCTCTATATTATCTATCTATCTATTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCATCTATCTATATCTTCTACCAAGTGATTTACTGTAATAAATTAGCTCATGCTATTATGGAGGATGAGTTCAAGATTTGTGGTCAGCAAGTTGCAGACTCA

output:

(CTAT)2 CTAA (CTAT)12 (TATC)13

allele: 28

  • (2+1+12+13)

Description

N30 means, that there are 30 unspecified nucleotides before final tandem repeat



Summary

There can be these types in motifs, which need to be defined, and each locus would have different combination of motifs:

Brackets: example [CTAT] – multiple copies of CTAT

No brackets: example CTAT – only one copy of CTAT

N#: example N30 - means 30 unspecified nucleotides (A/C/G/T)

Lower case: example ctat - means that these are not included in final allele number


Examples of real motifs:

[CTTT],TT,CT,[CTTT]

[TCTA],[TCTG],[TCTA],ta,[TCTA],tca,[TCTA],tccata,[TCTA],TA,[TCTA]

[TAGA],[CAGA],N48,[TAGA],[CAGA]

[AAGA],[AAGG],[AAGA]

and many more…


Thank you all in advance. Any help and ideas would be appreciated! :)


Solution

  • A good way to solve your problem is to use regex. REGular EXpression are a common way in programming to parse strings.
    With regex you can define which pattern you want to search in a string (almost like you did), which is the core of your problem.
    This means that regex have their own formatting, similar, but not identical, to yours.
    You can also write some code to convert your formatting to regex formatting, but you probably should write another question, avoiding all the DNA stuff.

    Lets see how regex works:
    Here is how your summary looks like in regex pattern:

    Summary

    There can be these types in motifs, which need to be defined, and each locus would have different combination of motifs:

    Brackets: example [CTAT] – multiple copies of CTAT - RegEx: (CTAT)+ (one or more) or (CTAT)* (zero or more)

    No brackets: example CTAT – only one copy of CTAT - RegEx: (CTAT){1}

    N#: example N30 - means 30 unspecified nucleotides (A/C/G/T) - RegEx: .{30}

    Lower case: example ctat - means that these are not included in final allele number - RegEx: (?:CTAT)

    With this knowledge, we can apply our regex to our inputs:
    example 1:

    import re # import regex module
    
    tandem = r"((AGAT){1}(AGAC){1}(AGAT)+(?:GAT){1}(AGAT)+)"
    
    mystring = "TTAGTTCAGGATAGTAGTTGTTTGGAAGCGCAACTCTCTGAGAAACTTAGTTATTCTCTCATCTATTTAGCTACAGCAAACTTCATGTGACAAAAGCCACACCCATAACTTTTTTCCTCTAGATAGACAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGATTCTCTTTCTCTGCATTCTCATCTATATTTCTGTCTTTCTCTTAATTATGGGTAACTCTTAGCCTGCCAGGCTACCATGGAAAGACAACCTTTAT" #input string
    
    analyzed_input = re.findall(tandem, mystring)[0]
    
    print(analyzed_input) #see the match found
    
    tot = 0
    max_len = max((len(al) for al in analyzed_input[1:] if len(al) <= 4)) # longest allele, maximum 4
    remaining_string = analyzed_input[0] #string to analyzed. will be cutted in for loop
    for allele in analyzed_input[1:]: #for each allele
        section = re.findall(r"((" + re.escape(allele) + ")+)", remaining_string)[0][0] # section where the allele is repeated
        value = section.count(allele) if len(allele) == max_len else section.count(allele)*(len(allele)/10.0) # get the value of the alleles. /10.0 if allele is shorter than the longest allele found
        remaining_string = remaining_string[remaining_string.index(section)+len(section):] # cut away from remaining string the current section
        print("The value of allele {0} is {1}\n".format(allele, value))
        if len(allele) <= 4: #add the allele value if his length is less than 5
            tot += value
    
    print("total allele number is: {0}".format(tot))
    

    OUTPUT: total allele number is: 16

    for the next examples, I'm only showing the regex tandem, the rest of the code is the same

    example 2:

    tandem2 = r"((TCTA)+(TCTG)+(TCTA)+(?:TA){1}(TCTA)+(?:TCA){1}(TCTA)+(?:TCCATA){1}(TCTA)+(TA)+(TCTA)+)"
    

    OUTPUT: total allele number is: 32.2

    example 3:

    tandem3 = r"((TCTA)+(TCTG)+(TCTA)+(?:TA){1}(TCTA)+(?:TCA){1}(TCTA)+(?:TCCATA){1}(TCTA)+(TA)*(TCTA)*)"
    

    OUTPUT: total allele number is: 31.0

    example 4:

    tandem4 = r"((CTAT)+(CTAA){1}(CTAT)+(.{30})(TATC)+)"
    

    OUTPUT: total allele number is: 28.0

    your other example will be written as:

    [CTTT],TT,CT,[CTTT] r"((CTTT)+(TT){1}(CT){1}(CTTT)+)"
    
    [TCTA],[TCTG],[TCTA],ta,[TCTA],tca,[TCTA],tccata,[TCTA],TA,[TCTA] r"((TCTA)+(TCTG)+(TCTA)+(?:TA){1}(TCTA)+(?:TCA){1}(TCTA)+(?:TCCATA){1}(TCTA)+(TA){1}(TCTA)+)"
    
    [TAGA],[CAGA],N48,[TAGA],[CAGA] r"((TAGA)+(CAGA)+(.{48})(TAGA)+(CAGA)+)"
    
    [AAGA],[AAGG],[AAGA] r"((AAGA)+(AAGG)+(AAGA)+)"
    

    Developing a full working framework will take a little bit of time, depending on what level of flexibility you want to achieve, input type, automation...