Search code examples
pythonmachine-learningdeep-learningone-hot-encoding

One hot encoding with ambiguitiy (nucleotide sequences)


Nucleotide sequences (or DNA sequences) generally are comprised of 4 bases: ATGC which makes for a very nice, easy and efficient way of encoding this for machine learning purposes.

sequence = AAATGCC
ohe_sequence = [[1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1]]

But when you take into account RNA sequences and mistakes that can sometimes occur in sequencing machines, the letters UYRWSKMDVHBXN are added... When one-hot encoding this you end up with a matrix of 17 rows of which the last 13 rows are generally all 0's.

This is very inefficient and does not confer the biological meaning that these extra (ambiguous) letters have.

For example:

  • T and U are interchangeable
  • Y represents there to be a C or T
  • N and X represent there to be any of the 4 bases (ATGC)

And so I have made a dictionary that represents this biological meaning

nucleotide_dict = {'A': 'A', 'T':'T', 'U':'T', 'G':'G', 'C':'C', 'Y':['C', 'T'], 'R':['A', 'G'], 'W':['A', 'T'], 'S':['G', 'C'], 'K':['T', 'G'], 'M':['C', 'A'], 'D':['A', 'T', 'G'], 'V':['A', 'G', 'C'], 'H':['A', 'T', 'C'], 'B':['T', 'G', 'C'], 'X':['A', 'T', 'G', 'C'], 'N':['A', 'T', 'G', 'C']}

But I can't seem to figure out how to make an efficient one-hot-encoding script (or wether there is a way to do this with the scikit learn module) that utilizes this dictionary in order to get a result like this:

sequence = ANTUYCC
ohe_sequence = [[1, 0, 0, 0], [1, 1, 1, 1], [0, 1, 0, 0], [0, 1, 0, 0], [0, 1, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1]]

# or even better:
ohe_sequence = [[1, 0, 0, 0], [0.25, 0.25, 0.25, 0.25], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0.5, 0, 0.5], [0, 0, 0, 1], [0, 0, 0, 1]]

This figure illustrates the nucleotide ambiguity


Solution

  • This was fun! I think you can do this using a dictionary with the appropriate values. I added the scikit-learn class because you mentioned you were using that. See the transform below:

    from sklearn.base import BaseEstimator, TransformerMixin
    import numpy as np
    
    nucleotide_dict = {
        "A": [1, 0, 0, 0],
        "G": [0, 1, 0, 0],
        "C": [0, 0, 1, 0],
        "T": [0, 0, 0, 1],
        "U": [0, 0, 0, 1],
        "Y": [0, 0, 1, 1],
        "R": [1, 1, 0, 0],
        "W": [1, 0, 0, 1],
        "S": [0, 1, 1, 0],
        "K": [0, 1, 0, 1],
        "M": [1, 0, 1, 0],
        "D": [1, 1, 0, 1],
        "V": [1, 1, 1, 0],
        "H": [1, 0, 1, 1],
        "B": [0, 1, 1, 1],
        "X": [1, 1, 1, 1],
        "N": [1, 1, 1, 1],
        "-": [0, 0, 0, 0],
    }
    
    
    class NucleotideEncoder(BaseEstimator, TransformerMixin):
        def __init__(self, norm=True):
            self.norm = norm
    
        def fit(self, X, y=None):
            return self
    
        def transform(self, X, y=None):
            f1 = lambda a: list(a)
            f2 = lambda g: nucleotide_dict[g]
            f3 = lambda c: list(map(f2, f1(c[0])))
            f4 = lambda t: np.array(f3(t)) / np.sum(np.array(f3(t)), axis=1)[:, np.newaxis]
            f = f3
            if self.norm:
                f = f4
            return np.apply_along_axis(f, 1, X)
    
    
    samples = np.array([["AAATGCC"], ["ANTUYCC"]])
    print(NucleotideEncoder().fit_transform(samples))
    

    Output:

    [[[1.   0.   0.   0.  ]
      [1.   0.   0.   0.  ]
      [1.   0.   0.   0.  ]
      [0.   0.   0.   1.  ]
      [0.   1.   0.   0.  ]
      [0.   0.   1.   0.  ]
      [0.   0.   1.   0.  ]]
    
     [[1.   0.   0.   0.  ]
      [0.25 0.25 0.25 0.25]
      [0.   0.   0.   1.  ]
      [0.   0.   0.   1.  ]
      [0.   0.   0.5  0.5 ]
      [0.   0.   1.   0.  ]
      [0.   0.   1.   0.  ]]]