Nucleotide sequences (or DNA sequences) generally are comprised of 4 bases: ATGC which makes for a very nice, easy and efficient way of encoding this for machine learning purposes.
sequence = AAATGCC
ohe_sequence = [[1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1]]
But when you take into account RNA sequences and mistakes that can sometimes occur in sequencing machines, the letters UYRWSKMDVHBXN are added... When one-hot encoding this you end up with a matrix of 17 rows of which the last 13 rows are generally all 0's.
This is very inefficient and does not confer the biological meaning that these extra (ambiguous) letters have.
For example:
And so I have made a dictionary that represents this biological meaning
nucleotide_dict = {'A': 'A', 'T':'T', 'U':'T', 'G':'G', 'C':'C', 'Y':['C', 'T'], 'R':['A', 'G'], 'W':['A', 'T'], 'S':['G', 'C'], 'K':['T', 'G'], 'M':['C', 'A'], 'D':['A', 'T', 'G'], 'V':['A', 'G', 'C'], 'H':['A', 'T', 'C'], 'B':['T', 'G', 'C'], 'X':['A', 'T', 'G', 'C'], 'N':['A', 'T', 'G', 'C']}
But I can't seem to figure out how to make an efficient one-hot-encoding script (or wether there is a way to do this with the scikit learn module) that utilizes this dictionary in order to get a result like this:
sequence = ANTUYCC
ohe_sequence = [[1, 0, 0, 0], [1, 1, 1, 1], [0, 1, 0, 0], [0, 1, 0, 0], [0, 1, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1]]
# or even better:
ohe_sequence = [[1, 0, 0, 0], [0.25, 0.25, 0.25, 0.25], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0.5, 0, 0.5], [0, 0, 0, 1], [0, 0, 0, 1]]
This was fun! I think you can do this using a dictionary with the appropriate values. I added the scikit-learn class because you mentioned you were using that. See the transform
below:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
nucleotide_dict = {
"A": [1, 0, 0, 0],
"G": [0, 1, 0, 0],
"C": [0, 0, 1, 0],
"T": [0, 0, 0, 1],
"U": [0, 0, 0, 1],
"Y": [0, 0, 1, 1],
"R": [1, 1, 0, 0],
"W": [1, 0, 0, 1],
"S": [0, 1, 1, 0],
"K": [0, 1, 0, 1],
"M": [1, 0, 1, 0],
"D": [1, 1, 0, 1],
"V": [1, 1, 1, 0],
"H": [1, 0, 1, 1],
"B": [0, 1, 1, 1],
"X": [1, 1, 1, 1],
"N": [1, 1, 1, 1],
"-": [0, 0, 0, 0],
}
class NucleotideEncoder(BaseEstimator, TransformerMixin):
def __init__(self, norm=True):
self.norm = norm
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
f1 = lambda a: list(a)
f2 = lambda g: nucleotide_dict[g]
f3 = lambda c: list(map(f2, f1(c[0])))
f4 = lambda t: np.array(f3(t)) / np.sum(np.array(f3(t)), axis=1)[:, np.newaxis]
f = f3
if self.norm:
f = f4
return np.apply_along_axis(f, 1, X)
samples = np.array([["AAATGCC"], ["ANTUYCC"]])
print(NucleotideEncoder().fit_transform(samples))
Output:
[[[1. 0. 0. 0. ]
[1. 0. 0. 0. ]
[1. 0. 0. 0. ]
[0. 0. 0. 1. ]
[0. 1. 0. 0. ]
[0. 0. 1. 0. ]
[0. 0. 1. 0. ]]
[[1. 0. 0. 0. ]
[0.25 0.25 0.25 0.25]
[0. 0. 0. 1. ]
[0. 0. 0. 1. ]
[0. 0. 0.5 0.5 ]
[0. 0. 1. 0. ]
[0. 0. 1. 0. ]]]