Search code examples
pythontranslationbioinformatics

Is there a way to translate a list of dna sequences into amino acids if the dna sequence is not divisible by three?


I have been struggling with turning a list of DNA sequences into amino acid sequences. The function i wrote should read the DNA list in three nucleotides. It should loop over the sequences in the list and translate each sequence, using codons in a directory. Now I know that this problem isn't exactly new and that Biopython has a translation module made for that kind of stuff. The difficulty lies in that I later want to use a degenerate codon directory, with an NNK-codon code (K being G or T) and as far as my research went there is no possibility to make custom codon dics with Biopython. Also the DNA sequences that I use aren't uniform in length.

Now I think it's time to go a little more in depth and explain where my data aka. the list of DNA sequences is coming from. The sequences (ranging from a couple 1000 to more than 1 million) are random nucleotides in between to markers that I isolated via a function using a regex search written to a text file. The structure of this file looks like this:

CACCAGAGTGAGAATAGAAA CCAAAAAAAAGGCTCCAAAAGGAGCCTTTAATTGTATC TAAACAGCTTGATACCGATAGTTGCGCCGACAATGACAACAACCATCGCCCACGCATAACCGATATATTC CCAAAAAAAAGGCTCCAAAAGGAGCCTTTAATTGTATC TAAACAGCTTGATACCGATAGTTGCGCCGACAATGACAACAACCATCGCCCACGCATAACCGATATATTC CCAAAAAAAAGGCTCCAAAAGGAGTCTTTAATTGTATC TAAACAGCTTGATACCGATAGTTGCGCCGACAATGACAACAACCATCGCCCACGCATAACCGATATATTC CCAAAAAAAGGCTCCAAAAGGAGCCTTTAATTGTATC TAAACAGCTTGATACCGATAGTTGCGCCGACAATGACAACAACCATCGCCCACGCATAACCGATATATTC CCAAAAAAAAGGCTCCAAAAGGAGCCTTTAATTGTATC TAAACAGCTTGATACCGATAGTTGCGCCGACAATGACAACAACCATCGCCCACGCATAACCGATATATTC CCAAAAAAAAGGCTCCAAAAGGAGCCTTTAATTGTATC TAAACAGCTTGATACCGATAGATGCGCCGACAATGACAACAACCATCGCCCACGCATAACCGATATATTC CAGCATTAGGAGCCGGCTGATGAGAGTGAGAATAGAAA CCAAAAAAAAGGCTCCAAAAGGAGCCTTTAATTGTATC TAAACAGCTTGATACCGATAGTTGTGCCGACAATGACAACAACCATCGCCCACGCATAACCGATATATTC

What i tried is to read in the file and get a list of all sequences as strings, get rid of whitespaces and newline breaks and that kind of stuff. Start a function in which the codon usage is defined and loop over the list of sequences for each sequence in a three letter fashion, translating them to the amino acid defined by the codon in the dict.

Code I got so far:

input_file = 'inserts.txt'
with open(input_file, 'r') as f:
    seq = f.readlines()

seq = [s.replace(" ", "").replace(",", "").replace("'", "").replace("\n", "") for s in seq]
print("\n".join(seq[:99]))
print("\nType lookup", type(seq))


# translation function and NNN codon table as a dict object
def translate(seq):
    nnn_table = {'TTT': 'F', 'TCT': 'S', 'TAT': 'Y', 'TGT': 'C', 'TTC': 'F', 'TCC': 'S', 'TAC': 'Y', 'TGC': 'C',
                 'TTA': 'L',
                 'TCA': 'S', 'TAA': '*', 'TGA': '*', 'TTG': 'L', 'TCG': 'S', 'TAG': '*', 'TGG': 'W', 'CTT': 'L',
                 'CCT': 'P',
                 'CAT': 'H', 'CGT': 'R', 'CTC': 'L', 'CCC': 'P', 'CAC': 'H', 'CGC': 'R', 'CTA': 'L', 'CCA': 'P',
                 'CAA': 'Q',
                 'CGA': 'R', 'CTG': 'L', 'CCG': 'P', 'CAG': 'Q', 'CGG': 'R', 'ATT': 'I', 'ACT': 'T', 'AAT': 'N',
                 'AGT': 'S',
                 'ATC': 'I', 'ACC': 'T', 'AAC': 'N', 'AGC': 'S', 'ATA': 'I', 'ACA': 'T', 'AAA': 'K', 'AGA': 'R',
                 'ATG': 'M',
                 'ACG': 'T', 'AAG': 'K', 'AGG': 'R', 'GTT': 'V', 'GCT': 'A', 'GAT': 'D', 'GGT': 'G', 'GTC': 'V',
                 'GCC': 'A',
                 'GAC': 'D', 'GGC': 'G', 'GTA': 'V', 'GCA': 'A', 'GAA': 'E', 'GGA': 'G', 'GTG': 'V', 'GCG': 'A',
                 'GAG': 'E',
                 'GGG': 'G'}
    # two loops, outer one to loop over the list of string sequences
    # inner one loops over each sequence
    nnn_aa_seq = []
    # generate amino acid sequence
    # add option for sequence or codon not divisible by three
    print("\nStarting to translate:")
    for dna in seq:
        protein_seq = ""
        for i in range(0, len(dna), 3):
            if len(dna) % 3 == 0:
                nnn_codon = nnn_table[dna[i:i + 3]]
                protein_seq += nnn_codon
            nnn_aa_seq.append(protein_seq)

    return "".join(nnn_aa_seq)


translate_nnn = translate(seq)
print(tranlate_nnn)
# do other stuff

Now my desired output would be a list with each amino acid sequence for each DNA sequence in the original text file.

What I get as "output" is this:

Starting to translate
**T*TA*TA**TA*Y*TA*YR*TA*YR**TA*YR*L*TA*YR*LR*TA*YR*LRR*TA*YR*LRRQ*TA*YR*LRRQ**TA*YR*LRRQ*Q*TA*YR*LRRQ*QQ*TA*YR*LRRQ*QQP*TA*YR*LRRQ*QQPS*TA*YR*LRRQ*QQPSP*TA*YR*LRRQ*QQPSPT*TA*YR*LRRQ*QQPSPTH*TA*YR*LRRQ*QQPSPTHN*TA*YR*

My guess on the problem would be that some sequences are not divisible by three. With those sequences I think it would be best to remove the overhang or have it replaced by a place holder. What do you guys think?

Edit:

All right I forgot to actually print the result and this looks nothing like i thought it would. It is one not distinguishable line of amino acids and not a list amino acid sequences for each DNA sequence. Anyhow my problem still exits. Help and any critic is welcome!


Solution

  • You are doing

    for dna in seq:
        protein_seq = ""
        for i in range(0, len(dna), 3):
            if len(dna) % 3 == 0:
                nnn_codon = nnn_table[dna[i:i + 3]]
                protein_seq += nnn_codon
            nnn_aa_seq.append(protein_seq)
    

    which means you are checking if len(dna) is divisible by 3 many times without need to do so. dna length is constant inside each outer for loop run therefore you might check that before starting inner for loop and also provide clear information about that like so

    for dna in seq:
        protein_seq = ""
        if len(dna) % 3 != 0:
            print('DNA length not divisible by 3')
            continue  # go to next element of seq
        for i in range(0, len(dna), 3):
            nnn_codon = nnn_table[dna[i:i + 3]]
            protein_seq += nnn_codon
            nnn_aa_seq.append(protein_seq)