Search code examples
pythonbioinformaticsbiopython

How to deal with gaps during translation with biopython


I need to translate aligned DNA sequences with biopython

from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
seq = Seq("tt-aaaatg")
seq.translate()

Running this script will get error:

Bio.Data.CodonTable.TranslationError: Codon 'TT-' is invalid.

Is there a way to translate the 'tt-' as X and thus the whole translated sequences will be 'XKM'?

This will be very useful in translating aligned sequences. For example, an aligned sequence set is stored in form of pandas DataFrame named as "df" as:

import pandas as pd

df = pd.DataFrame([['A',Seq("tt-aaaatg")],['B',Seq("tttaaaatg")],['C',Seq("tttaaaatg")]],columns=['seqName','seq'])

print(df)

The df will be shown as:

seqName                seq
        A                 Seq("tt-aaaatg")
        B                 Seq("tttaaaatg")
        C                 Seq("tt-aaaatg")

If 'tt-' can be translated as "x", then using the code:

df['prot'] = pd.Seris([x.translate() for x in df.seq])

We can get:

  seqName                          seq           prot
0       A           (t, t, g, a, a, a, a, t, g)  (X, K, M)
1       B           (t, t, t, a, a, a, a, t, g)  (F, K, M)
2       C           (t, t, t, a, a, a, a, t, g)  (F, K, M)

However the current biopython can not translate "tt-" as "X" and it just throw out error. It seems to me that I have to remove all gaps in the aligned sequences and then translated them after which I have to realign the translated protein sequences.

How do you deal with such a problem? Thank you in advance.


Solution

  • Note: Assuming that these are real 1bp deletions from an in-frame (frame 1) amino acid coding alignment.

    To do this, you can use a custom translation function like this:

    def translate_dna(sequence):
    """
    :param sequence: (str) a DNA sequence string
    :return: (str) a protein string from the forward reading frame 1
    """
    
    codontable = {'ATA': 'I', 'ATC': 'I', 'ATT': 'I', 'ATG': 'M',
                  'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T',
                  'AAC': 'N', 'AAT': 'N', 'AAA': 'K', 'AAG': 'K',
                  'AGC': 'S', 'AGT': 'S', 'AGA': 'R', 'AGG': 'R',
                  'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L',
                  'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P',
                  'CAC': 'H', 'CAT': 'H', 'CAA': 'Q', 'CAG': 'Q',
                  'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R',
                  'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V',
                  'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A',
                  'GAC': 'D', 'GAT': 'D', 'GAA': 'E', 'GAG': 'E',
                  'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G',
                  'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S',
                  'TTC': 'F', 'TTT': 'F', 'TTA': 'L', 'TTG': 'L',
                  'TAC': 'Y', 'TAT': 'Y', 'TAA': '*', 'TAG': '*',
                  'TGC': 'C', 'TGT': 'C', 'TGA': '*', 'TGG': 'W',
                  '---': '-',
                  }
    
    seq = sequence.upper()
    prot = []
    
    for n in range(0, len(seq), 3):
        if seq[n:n + 3] in codontable:
            residue = codontable[seq[n:n + 3]]
        else:
            residue = "X"
    
        prot.append(residue)
    
    return "".join(prot)
    

    The else statement will cause any non-recognised codon (including the presence of degenerate bases, to be translated as an x. Just pass the sequence string into this function. If your sequences are stored as seq objects, you can modify the line in the function like this:

    seq = sequence.seq.upper()