Search code examples
javascriptdna-sequencegenetics

Converting nucleotides to amino acids using JavaScript


I'm creating a Chrome Extension that converts a string of nucleotides of length nlen into the corresponding amino acids.

I've done something similar to this before in Python but as I'm still very new to JavaScript I'm struggling to translate that same logic from Python to JavaScript. The code I have so far is the below:

function translateInput(n_seq) {
  // code to translate goes here

  // length of input nucleotide sequence
  var nlen = n_seq.length

  // declare initially empty amino acids string
  var aa_seq = ""

  // iterate over each chunk of three characters/nucleotides
  // to match it with the correct codon
  for (var i = 0; i < nlen; i++) {




      aa_seq.concat(codon)
  }

  // return final string of amino acids   
  return aa_seq
}

I know that I want to iterate over characters three at a time, match them to the correct amino acid, and then continuously concatenate that amino acid to the output string of amino acids (aa_seq), returning that string once the loop is complete.

I also tried creating a dictionary of the codon to amino acid relationships and was wondering if there was a way to use something like that as a tool to match the three character codons to their respective amino acids:

codon_dictionary = { 
 "A": ["GCA","GCC","GCG","GCT"], 
 "C": ["TGC","TGT"], 
 "D": ["GAC", "GAT"],
 "E": ["GAA","GAG"],
 "F": ["TTC","TTT"],
 "G": ["GGA","GGC","GGG","GGT"],
 "H": ["CAC","CAT"],
 "I": ["ATA","ATC","ATT"],
 "K": ["AAA","AAG"],
 "L": ["CTA","CTC","CTG","CTT","TTA","TTG"],
 "M": ["ATG"],
 "N": ["AAC","AAT"],
 "P": ["CCA","CCC","CCG","CCT"],
 "Q": ["CAA","CAG"],
 "R": ["AGA","AGG","CGA","CGC","CGG","CGT"],
 "S": ["AGC","AGT","TCA","TCC","TCG","TCT"],
 "T": ["ACA","ACC","ACG","ACT"],
 "V": ["GTA","GTC","GTG","GTT"],
 "W": ["TGG"],
 "Y": ["TAC","TAT"],
};

EDIT: An example of an input string of nucleotides would be "AAGCATAGAAATCGAGGG", with the corresponding output string "KHRNRG". Hope this helps!


Solution

  • Opinion

    The first thing I would personally recommend is to build a dictionary that goes from 3-char codon to amino. This will allow your program to take several chains of codon strings and convert them to amino strings without having to do expensive deep lookups every time. The dictionary will work something like this

    codonDict['GCA'] // 'A'
    codonDict['TGC'] // 'C'
    // etc
    

    From there, I implemented two utility functions: slide and slideStr. These aren't particularly important, so I'll just cover them with a couple examples of input and output.

    slide (2,1) ([1,2,3,4])
    // [[1,2], [2,3], [3,4]]
    
    slide (2,2) ([1,2,3,4])
    // [[1,2], [3,4]]
    
    slideStr (2,1) ('abcd')
    // ['ab', 'bc', 'cd']
    
    slideStr (2,2) ('abcd')
    // ['ab', 'cd']
    

    With the reverse dictionary and generic utility functions at our disposal, writing codon2amino is a breeze

    // codon2amino :: String -> String
    const codon2amino = str =>
      slideStr(3,3)(str)
        .map(c => codonDict[c])
        .join('')
    

    Runnable demo

    To clarify, we build codonDict based on aminoDict once, and re-use it for every codon-to-amino computation.

    // your original data renamed to aminoDict
    const aminoDict = { 'A': ['GCA','GCC','GCG','GCT'], 'C': ['TGC','TGT'], 'D': ['GAC', 'GAT'], 'E': ['GAA','GAG'], 'F': ['TTC','TTT'], 'G': ['GGA','GGC','GGG','GGT'], 'H': ['CAC','CAT'], 'I': ['ATA','ATC','ATT'], 'K': ['AAA','AAG'], 'L': ['CTA','CTC','CTG','CTT','TTA','TTG'], 'M': ['ATG'], 'N': ['AAC','AAT'], 'P': ['CCA','CCC','CCG','CCT'], 'Q': ['CAA','CAG'], 'R': ['AGA','AGG','CGA','CGC','CGG','CGT'], 'S': ['AGC','AGT','TCA','TCC','TCG','TCT'], 'T': ['ACA','ACC','ACG','ACT'], 'V': ['GTA','GTC','GTG','GTT'], 'W': ['TGG'], 'Y': ['TAC','TAT'] };
    
    // codon dictionary derived from aminoDict
    const codonDict =
     Object.keys(aminoDict).reduce((dict, a) =>
       Object.assign(dict, ...aminoDict[a].map(c => ({[c]: a}))), {})
    
    // slide :: (Int, Int) -> [a] -> [[a]]
    const slide = (n,m) => xs => {
      if (n > xs.length)
        return []
      else
        return [xs.slice(0,n), ...slide(n,m) (xs.slice(m))]
    }
    
    // slideStr :: (Int, Int) -> String -> [String]
    const slideStr = (n,m) => str =>
      slide(n,m) (Array.from(str)) .map(s => s.join(''))
    
    // codon2amino :: String -> String
    const codon2amino = str =>
      slideStr(3,3)(str)
        .map(c => codonDict[c])
        .join('')
    
    console.log(codon2amino('AAGCATAGAAATCGAGGG'))
    // KHRNRG


    Further explanation

    can you clarify what some of these variables are supposed to represent? (n, m, xs, c, etc)

    Our slide function gives us a sliding window over an array. It expects two parameters for the window – n the window size, and m the step size – and one parameter that is the array of items to iterate thru – xs, which can be read as x's, or plural x, as in a collection of x items

    slide is purposefully generic in that it can work on any iterable xs. That means it can work with an Array, a String, or anything else that implements Symbol.iterator. That's also why we use a generic name like xs because naming it something specific pigeonholes us into thinking it can only work with a specific type

    Other things like the variable c in .map(c => codonDict[c]) are not particularly important – I named it c for codon, but we could've named it x or foo, it doesn't matter. The "trick" to understanding c is to understand .map.

    [1,2,3,4,5].map(c => f(c))
    // [f(1), f(2), f(3), f(4), f(5)]
    

    So really all we're doing here is taking an array ([1 2 3 4 5]) and making a new array where we call f for each element in the original array

    Now when we look at .map(c => codonDict[c]) we understand that all we're doing is looking up c in codonDict for each element

    const codon2amino = str =>
      slideStr(3,3)(str)          // [ 'AAG', 'CAT', 'AGA', 'AAT', ...]
        .map(c => codonDict[c])   // [ codonDict['AAG'], codonDict['CAT'], codonDict['AGA'], codonDict['AAT'], ...]
        .join('')                 // 'KHRN...'
    

    Also, are these 'const' items able to essentially replace my original translateInput() function?

    If you're not familiar with ES6 (ES2015), some of the syntaxes used above might seem foreign to you.

    // foo using traditional function syntax
    function foo (x) { return x + 1 }
    
    // foo as an arrow function
    const foo = x => x + 1
    

    So in short, yes, codon2amino is the exact replacement for your translateInput, just defined using a const binding and an arrow function. I chose codon2amino as a name because it better describes the operation of the function – translateInput doesn't say which way it's translating (A to B, or B to A?), and "input" is sort of a senseless descriptor here because all functions can take input.

    The reason you're seeing other const declarations is because we're splitting up the work of your function into multiple functions. The reasons for this are mostly beyond the scope of this answer, but the brief explanation is that one specialized function that takes on the responsibility of several tasks is less useful to us than multiple generic functions that can be combined/re-used in sensible ways.

    Sure, codon2amino needs look at each 3-letter sequence in the input string, but that doesn't mean we have to write the string-splitting code inside of the codon2amino function. We can write a generic string splitting function like we did with slideStr which is useful to any function that wants to iterate thru string sequences and then have our codon2amino function use it – if we encapsulated all of that string-splitting code inside of codon2amino, the next time we needed to iterate thru string sequences, we'd have to duplicate that portion of the code.


    All that said..

    Is there any way I can do this while keeping my original for loop structure?

    I really think you should spend some time stepping thru the code above to see how it works. There's a lot of valuable lessons to learn there if you haven't yet seen program concerns separated in this way.

    Of course that's not the only way to solve your problem tho. We can use a primitive for loop. For me it's more mental overhead to thinking about creating iterators i and manually incrementing i++ or i += 3, making sure to check i < str.length, reassignment of the return value result += something etc – add a couple more variables and your brain quickly turns to soup.

    function makeCodonDict (aminoDict) {
      let result = {}
      for (let k of Object.keys(aminoDict))
        for (let a of aminoDict[k])
          result[a] = k
      return result
    }
    
    function translateInput (dict, str) {
      let result = ''
      for (let i = 0; i < str.length; i += 3)
        result += dict[str.substr(i,3)]
      return result
    }
    
    const aminoDict = { 'A': ['GCA','GCC','GCG','GCT'], 'C': ['TGC','TGT'], 'D': ['GAC', 'GAT'], 'E': ['GAA','GAG'], 'F': ['TTC','TTT'], 'G': ['GGA','GGC','GGG','GGT'], 'H': ['CAC','CAT'], 'I': ['ATA','ATC','ATT'], 'K': ['AAA','AAG'], 'L': ['CTA','CTC','CTG','CTT','TTA','TTG'], 'M': ['ATG'], 'N': ['AAC','AAT'], 'P': ['CCA','CCC','CCG','CCT'], 'Q': ['CAA','CAG'], 'R': ['AGA','AGG','CGA','CGC','CGG','CGT'], 'S': ['AGC','AGT','TCA','TCC','TCG','TCT'], 'T': ['ACA','ACC','ACG','ACT'], 'V': ['GTA','GTC','GTG','GTT'], 'W': ['TGG'], 'Y': ['TAC','TAT'] };
    const codonDict = makeCodonDict(aminoDict)
    
    const codons = 'AAGCATAGAAATCGAGGG'
    const aminos = translateInput(codonDict, codons)
    console.log(aminos) // KHRNRG