Search code examples
rclassrandomfrequency-table

Write an R function that takes an amino acid sequence and stochastically returns a codon string


Using R, I'd like to write a "reverse translation" function that takes as input a string of amino acids and for each one returns a triplet codon that encodes that amino acid. The codon would be randomly chosen from a frequency table that I will initially hard-code the values for, but I would also like to eventually be able to adjust the frequencies so that "rarer" codons will be more likely to be chosen.

So far I have tried to define a "codon" class using the 61 codons that encode an amino acid in the standard genetic code.

# Define codon class
codon <- function(sequence, aminoacid, frequency) {
  new_codon <- list(sequence = sequence, aminoacid = aminoacid, frequency = frequency)
  class(new_codon) <- "codon"
  new_codon
}
# Define the 61 codon frequency table
# structure: nucleotide sequence, amino acid it encodes, frequency codon is used
codons <- list(
  # Alanine (Ala) codons
  codon("GCT", "A", 0.26),
  codon("GCC", "A", 0.40),
  codon("GCA", "A", 0.23),
  codon("GCG", "A", 0.11),
  # Cysteine (Cys) codons
  codon("TGT", "C", 0.45),
  codon("TGC", "C", 0.55),
  # Aspartic acid (Asp) codons
  codon("GAT", "D", 0.46),
  codon("GAC", "D", 0.54),
  # Glutamic acid (Glu) codons
  codon("GAA", "E", 0.42),
  codon("GAG", "E", 0.58),
  # Phenylalanine (Phe) codons
  codon("TTT", "F", 0.45),
  codon("TTC", "F", 0.55),
  # Glycine (Gly) codons
  codon("GGT", "G", 0.16),
  codon("GGC", "G", 0.34),
  codon("GGA", "G", 0.25),
  codon("GGG", "G", 0.25),
  # Histidine (His) codons
  codon("CAT", "H", 0.41),
  codon("CAC", "H", 0.59),
  # Isoleucine (Ile) codons
  codon("ATT", "I", 0.36),
  codon("ATC", "I", 0.48),
  codon("ATA", "I", 0.16),
  # Lysine (Lys) codons
  codon("AAA", "K", 0.42),
  codon("AAG", "K", 0.58),
  # Leucine (Leu) codons
  codon("TTA", "L", 0.07),
  codon("TTG", "L", 0.13),
  codon("CTT", "L", 0.13),
  codon("CTC", "L", 0.20),
  codon("CTA", "L", 0.07),
  codon("CTG", "L", 0.41),
  # Methionine (Met) codon
  codon("ATG", "M", 1.0),
  # Asparagine (Asn) codons
  codon("AAT", "N", 0.46),
  codon("AAC", "N", 0.54),
  # Proline (Pro) codons
  codon("CCT", "P", 0.28),
  codon("CCC", "P", 0.33),
  codon("CCA", "P", 0.27),
  codon("CCG", "P", 0.11),
  # Glutamine (Gln) codons
  codon("CAA", "Q", 0.25),
  codon("CAG", "Q", 0.75),
  # Arginine (Arg) codons
  codon("CGT", "R", 0.08),
  codon("CGC", "R", 0.19),
  codon("CGA", "R", 0.11),
  codon("CGG", "R", 0.21),
  codon("AGA", "R", 0.20),
  codon("AGG", "R", 0.20),
  # Serine (Ser) codons
  codon("TCT", "S", 0.18),
  codon("TCC", "S", 0.22),
  codon("TCA", "S", 0.15),
  codon("TCG", "S", 0.06),
  codon("AGT", "S", 0.15),
  codon("AGC", "S", 0.24),
  # Threonine (Thr) codons
  codon("ACT", "T", 0.24),
  codon("ACC", "T", 0.36),
  codon("ACA", "T", 0.28),
  codon("ACG", "T", 0.12),
  # Valine (Val) codons
  codon("GTT", "V", 0.18),
  codon("GTC", "V", 0.24),
  codon("GTA", "V", 0.11),
  codon("GTG", "V", 0.47),
  # Tryptophan (Trp) codon
  codon("TGG", "W", 1.0),
  # Tyrosine (Tyr) codons
  codon("TAT", "Y", 0.43),
  codon("TAC", "Y", 0.57)
)

The function I tried below:

# Define function to generate new sequence
random_sequence <- function(protein_sequence) {
  # Initialize empty list to store new sequence
  new_sequence <- list()
  # Iterate over each amino acid in the protein sequence
  for (aa in protein_sequence) {
    # Get all codons that encode the current amino acid
    aa_codons <- codons[sapply(codons, function(x) x$aminoacid == aa)]
    # Get the frequencies of the codons
    aa_freqs <- sapply(aa_codons, function(x) x$frequency)
    # Normalize the frequencies to sum to 1
    aa_freqs <- aa_freqs / sum(aa_freqs)
    # Select a random codon based on the frequencies
    new_codon <- aa_codons[sample(1:length(aa_codons), 1, prob = aa_freqs)]
    # Add the new codon to the new sequence
    new_sequence <- c(new_sequence, new_codon[[1]][1]$sequence)
  }
  # Return the new sequence
  new_sequence <- paste(new_sequence, collapse = "")
  return(new_sequence)
}

# Example usage
protein_sequence <- "MAVVDLKECEILHTWVFPKKKHGEARNDDCQQGKPST"
random_sequence(protein_sequence)

When I run each line of this function individually I get the results I expect. Also, if I set the protein_sequence to be one letter long, it also works. However, when I try to run the function as written, I get the error message:

Error in sum(aa_freqs) : invalid 'type' (list) of argument

I tried to fix this by removing the normalize frequencies to sum to 1 safety check (# aa_freqs <- aa_freqs / sum(aa_freqs). But now I get a different error:

Error in sample.int(length(x), size, replace, prob) : incorrect number of probabilities

I am not sure how to troubleshoot the bug given that when I run through the function line by line with a one letter test amino acid, it all seems to work fine.

Any assistance or insight would be greatly appreciated! Thank you very much for your help!


Solution

  • It looks like your function is trying to parse the whole amino acid sequence at once, rather than letter-by-letter. You can use strsplit or equivalent to split the passed string into component parts and then more or less do as you have indicated:

    ReverseTranscribe <- function(protein_sequence) {
      paste(sapply(unlist(strsplit(protein_sequence, "")), function(cur_amino) {
        # Extract possible codons
        potential_matches <- codons[sapply(codons, function(x) {x$aminoacid == cur_amino})]
        # Convert to a weights table
        weights_table <- data.frame(matrix(sapply(potential_matches, unlist), ncol = 3, byrow = TRUE))
        # Weighted sample
        chosen_codon <- sample(weights_table[[1]], 1, prob = weights_table[[3]])
        return(chosen_codon)
      }), collapse = "-")
    }
    
    example <- "MAVVDLKECEILHTWVFPKKKHGEARNDDCQQGKPST"
    
    > for (i in 1:3) {
    +   print(ReverseTranscribe(example))
    + }
    [1] "ATG-GCT-GTC-GTG-GAT-CTC-AAG-GAA-TGT-GAG-ATC-CTG-CAT-ACC-TGG-GTT-TTC-CCC-AAG-AAA-AAG-CAC-GGT-GAG-GCT-CGG-AAT-GAC-GAT-TGC-CAG-CAG-GGC-AAG-CCT-AGC-ACC"
    [1] "ATG-GCC-GTG-GTG-GAT-CTG-AAG-GAG-TGC-GAG-ATC-CTG-CAC-ACC-TGG-GTC-TTT-CCT-AAG-AAG-AAA-CAT-GGC-GAG-GCC-CGT-AAT-GAC-GAC-TGC-CAA-CAA-GGG-AAG-CCA-TCA-ACA"
    [1] "ATG-GCC-GTG-GTG-GAT-TTG-AAA-GAG-TGC-GAG-ATC-CTG-CAC-ACC-TGG-GTT-TTC-CCC-AAG-AAG-AAG-CAC-GGA-GAG-GCT-CGC-AAC-GAT-GAT-TGT-CAG-CAG-GGC-AAG-CCC-TCT-ACC"