Search code examples
pythoncountbioinformaticsbiopython

Counting triplets in a DNA-sequence


I want to make a code which counts all triplets in a sequence. I've read a plenty of posts so far, but none of them could help me.

This is my code:

def cnt(seq):
    mydict = {}
    if len(seq) % 3 == 0:
        a = [x for x in seq]
        for i in range(len(seq)//3):
            b = ''.join(a[(0+3*i):(3+3*i)])
            for base1 in ['A', 'T', 'G', 'C']:
                for base2 in ['A', 'T', 'G', 'C']:
                    for base3 in ['A', 'T', 'G', 'C']:
                        triplet = base1 + base2 + base3
                        if b == triplet:
                            mydict[b] = 1
        for key in sorted(mydict):
            print("%s: %s" % (key, mydict[key]))
    else:
        print("Error")

Does Biopython provide a function to solve this problem?

EDIT:

Note that, for instance, in the sequence 'ATGAAG', 'TGA' or 'GAA' are not "valid" triplets, only 'ATG' and 'AAG', because in biology and bioinformatics, we read it 'ATG' and 'AAG', thats the information we need to translate it or whatever else.

You can imagine it as a sequence of words, for example "Hello world". The way we read it is "Hello" and "world", not "Hello", "ello ", "llo w",...


Solution

  • You can use clever techniques, as suggested in the other answers, but I will build a solution starting from your code, which is almost working: Your problem is that every time you do mydict[b] = 1, you reset the count of b to 1.

    A minimal fix

    You could solve this by testing if the key is present, if not, create the entry in the dict, then increment the value, but there are more convenient tools in python.

    A minimal change to your code would be to use a defaultdict(int) instead of a dict. Whenever a new key is encountered, it is assumed to have the associated default value for an int: 0. So you can increment the value instead of resetting:

    from collections import defaultdict
    
    def cnt(seq):
         # instanciate a defaultdict that creates ints when necessary
         mydict = defaultdict(int)
         if len(seq) % 3 == 0:
             a = [x for x in seq]
             for i in range(len(seq)//3):
                 b = ''.join(a[(0+3*i):(3+3*i)])
                 for base1 in ['A', 'T', 'G', 'C']:
                     for base2 in ['A', 'T', 'G', 'C']:
                         for base3 in ['A', 'T', 'G', 'C']:
                             triplet = base1 + base2 + base3
                             if b == triplet:
                                 # increment the existing count (or the default 0 value)
                                 mydict[b] += 1
             for key in sorted(mydict):
                 print("%s: %s" % (key, mydict[key]))
         else:
             print("Error")
    

    It works as desired:

    cnt("ACTGGCACT")
    ACT: 2
    GGC: 1
    

    Some possible improvements

    Now let's try to improve your code a bit.

    First, as I wrote in the comments, let's avoid the un-necessary conversion of your sequence to a list, and use a better variable name for the currently counted codon:

    from collections import defaultdict
    
    def cnt(seq):
         mydict = defaultdict(int)
         if len(seq) % 3 == 0:
             a = [x for x in seq]
             for i in range(len(seq)//3):
                 codon = seq[(0+3*i):(3+3*i)]
                 for base1 in ['A', 'T', 'G', 'C']:
                     for base2 in ['A', 'T', 'G', 'C']:
                         for base3 in ['A', 'T', 'G', 'C']:
                             triplet = base1 + base2 + base3
                             if codon == triplet:
                                 mydict[codon] += 1
             for key in sorted(mydict):
                 print("%s: %s" % (key, mydict[key]))
         else:
             print("Error")
    

    Now lets simplify the nested loop part, trying all possible codons, by generating in advance the set of possible codons:

    from collections import defaultdict
    from itertools import product
    
    codons = {
        "".join((base1, base2, base3))
            for (base1, base2, base3) in product("ACGT", "ACGT", "ACGT")}
    
    def cnt(seq):
         mydict = defaultdict(int)
         if len(seq) % 3 == 0:
             a = [x for x in seq]
             for i in range(len(seq)//3):
                 codon = seq[(0+3*i):(3+3*i)]
                 if codon in codons:
                     mydict[codon] += 1
             for key in sorted(mydict):
                 print("%s: %s" % (key, mydict[key]))
         else:
             print("Error")
    

    Now, your code simply ignores the triplets that are not valid codons. Maybe you should instead issue a warning:

    from collections import defaultdict
    from itertools import product
    
    codons = {
        "".join((base1, base2, base3))
            for (base1, base2, base3) in product("ACGT", "ACGT", "ACGT")}
    
    def cnt(seq):
         mydict = defaultdict(int)
         if len(seq) % 3 == 0:
             a = [x for x in seq]
             for i in range(len(seq)//3):
                 codon = seq[(0+3*i):(3+3*i)]
                 # We count even invalid triplets
                 mydict[codon] += 1
             # We display counts only for valid triplets
             for codon in sorted(codons):
                 print("%s: %s" % (codon, mydict[codon]))
             # We compute the set of invalid triplets:
             # the keys that are not codons.
             invalid = mydict.keys() - codons
             # An empty set has value False in a test.
             # We issue a warning if the set is not empty.
             if invalid:
                 print("Warning! There are invalid triplets:")
                 print(", ".join(sorted(invalid)))
         else:
             print("Error")
    

    A more fancy solution

    Now a more fancy solution, using cytoolz (probably needs to be installed because it is not part of usual python distributions: pip3 install cytoolz, if you are using pip):

    from collections import Counter
    from itertools import product, repeat
    from cytoolz import groupby, keymap, partition 
    
    # To make strings out of lists of strings
    CAT = "".join
    
    # The star "extracts" the elements from the result of repeat,
    # so that product has 3 arguments, and not a single one
    codons = {CAT(bases) for bases in product(*repeat("ACGT", 3))}
    
    def cnt(seq):
        # keymap(CAT, ...) transforms the keys (that are tuples of letters)
        # into strings
        # if len(seq) is not a multiple of 3, pad="-" will append "-"
        # to complete the last triplet (which will be an invalid one)
        codon_counts = keymap(CAT, Counter(partition(3, seq, pad="-")))
    
        # separate encountered codons into valids and invalids
        codons_by_validity = groupby(codons.__contains__, codon_counts.keys())
        # get allows to provide a default value,
        # in case one of the categories is not present
        valids = codons_by_validity.get(True, [])
        invalids = codons_by_validity.get(False, [])
    
        # We display counts only for valid triplets
        for codon in sorted(valids):
            print("%s: %s" % (codon, codon_counts[codon]))
    
        # We issue a warning if there are invalid codons.
        if invalids:
            print("Warning! There are invalid triplets:")
            print(", ".join(sorted(invalids)))
    

    Hope this helps.