python machine-learning deep-learning one-hot-encoding

One hot encoding with ambiguitiy (nucleotide sequences)

Nucleotide sequences (or DNA sequences) generally are comprised of 4 bases: ATGC which makes for a very nice, easy and efficient way of encoding this for machine learning purposes.

sequence = AAATGCC
ohe_sequence = [[1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1]]

But when you take into account RNA sequences and mistakes that can sometimes occur in sequencing machines, the letters UYRWSKMDVHBXN are added... When one-hot encoding this you end up with a matrix of 17 rows of which the last 13 rows are generally all 0's.

This is very inefficient and does not confer the biological meaning that these extra (ambiguous) letters have.

For example:

T and U are interchangeable
Y represents there to be a C or T
N and X represent there to be any of the 4 bases (ATGC)

And so I have made a dictionary that represents this biological meaning

nucleotide_dict = {'A': 'A', 'T':'T', 'U':'T', 'G':'G', 'C':'C', 'Y':['C', 'T'], 'R':['A', 'G'], 'W':['A', 'T'], 'S':['G', 'C'], 'K':['T', 'G'], 'M':['C', 'A'], 'D':['A', 'T', 'G'], 'V':['A', 'G', 'C'], 'H':['A', 'T', 'C'], 'B':['T', 'G', 'C'], 'X':['A', 'T', 'G', 'C'], 'N':['A', 'T', 'G', 'C']}

But I can't seem to figure out how to make an efficient one-hot-encoding script (or wether there is a way to do this with the scikit learn module) that utilizes this dictionary in order to get a result like this:

sequence = ANTUYCC
ohe_sequence = [[1, 0, 0, 0], [1, 1, 1, 1], [0, 1, 0, 0], [0, 1, 0, 0], [0, 1, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1]]

# or even better:
ohe_sequence = [[1, 0, 0, 0], [0.25, 0.25, 0.25, 0.25], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0.5, 0, 0.5], [0, 0, 0, 1], [0, 0, 0, 1]]

Solution

This was fun! I think you can do this using a dictionary with the appropriate values. I added the scikit-learn class because you mentioned you were using that. See the transform below:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

nucleotide_dict = {
    "A": [1, 0, 0, 0],
    "G": [0, 1, 0, 0],
    "C": [0, 0, 1, 0],
    "T": [0, 0, 0, 1],
    "U": [0, 0, 0, 1],
    "Y": [0, 0, 1, 1],
    "R": [1, 1, 0, 0],
    "W": [1, 0, 0, 1],
    "S": [0, 1, 1, 0],
    "K": [0, 1, 0, 1],
    "M": [1, 0, 1, 0],
    "D": [1, 1, 0, 1],
    "V": [1, 1, 1, 0],
    "H": [1, 0, 1, 1],
    "B": [0, 1, 1, 1],
    "X": [1, 1, 1, 1],
    "N": [1, 1, 1, 1],
    "-": [0, 0, 0, 0],
}


class NucleotideEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, norm=True):
        self.norm = norm

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        f1 = lambda a: list(a)
        f2 = lambda g: nucleotide_dict[g]
        f3 = lambda c: list(map(f2, f1(c[0])))
        f4 = lambda t: np.array(f3(t)) / np.sum(np.array(f3(t)), axis=1)[:, np.newaxis]
        f = f3
        if self.norm:
            f = f4
        return np.apply_along_axis(f, 1, X)


samples = np.array([["AAATGCC"], ["ANTUYCC"]])
print(NucleotideEncoder().fit_transform(samples))

Output:

[[[1.   0.   0.   0.  ]
  [1.   0.   0.   0.  ]
  [1.   0.   0.   0.  ]
  [0.   0.   0.   1.  ]
  [0.   1.   0.   0.  ]
  [0.   0.   1.   0.  ]
  [0.   0.   1.   0.  ]]

 [[1.   0.   0.   0.  ]
  [0.25 0.25 0.25 0.25]
  [0.   0.   0.   1.  ]
  [0.   0.   0.   1.  ]
  [0.   0.   0.5  0.5 ]
  [0.   0.   1.   0.  ]
  [0.   0.   1.   0.  ]]]