Search code examples
rbioinformaticsdna-sequence

How to find specific frequency of a codon?


I am trying to make a function in R which could calculate the frequency of each codon. We know that methionine is an amino acid which could be formed by only one set of codon ATG so its percentage in every set of sequence is 1. Where as Glycine could be formed by GGT, GGC, GGA, GGG hence the percentage of occurring of each codon will be 0.25. The input would be in a DNA sequence like-ATGGGTGGCGGAGGG and with the help of codon table it could calculate the percentage of each occurrence in an input.

please help me by suggesting ways to make this function.

for example, if my argument is ATGTGTTGCTGG then, my result would be

ATG=1
TGT=0.5
TGC=0.5
TGG=1

Data for R:

codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T", 
    ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K", 
    AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L", 
    CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P", 
    CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R", 
    CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V", 
    GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D", 
    GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G", 
    GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F", 
    TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop", 
    TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")

Solution

  • First, I get my lookup list and sequence.

    codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T", 
                  ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K", 
                  AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L", 
                  CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P", 
                  CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R", 
                  CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V", 
                  GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D", 
                  GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G", 
                  GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F", 
                  TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop", 
                  TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")
    
    MySeq <- "ATGTGTTGCTGG"
    

    Next, I load the stringi library and break the sequence into chunks of three characters.

    # Load library
    library(stringi)
    
    # Break into 3 bases
    seq_split <- stri_sub(MySeq, seq(1, stri_length(MySeq), by=3), length=3)
    

    Then, I count the letters that these three base chunks correspond to using table.

    # Get associated letters
    letter_count <- table(unlist(codon[seq_split]))
    

    Finally, I bind the sequences together with the reciprocal of the count and rename my data frame columns.

    # Bind into a data frame
    res <- data.frame(seq_split,
                      1/letter_count[match(unlist(codon[seq_split]), names(letter_count))])
    
    # Rename columns
    colnames(res) <- c("Sequence", "Letter", "Percentage")
    
    #  Sequence Letter Percentage
    #1      ATG      M        1.0
    #2      TGT      C        0.5
    #3      TGC      C        0.5
    #4      TGG      W        1.0