Search code examples
pythonscikit-learnbioinformaticsone-hot-encoding

OneHotEncoding Protein Sequences


I have an original dataframe of sequences listed below and am trying to use one-hot encoding and then store these in a new dataframe, I am trying to do it with the following code but am not able to store because I get the following output afterwards:

Code:

onehot_encoder = OneHotEncoder()
sequence = np.array(list(x_train['sequence'])).reshape(-1, 1)
encoded_sequence = onehot_encoder.fit_transform(sequence).toarray()
encoded_sequence

enter image description here

but get error

ValueError: Wrong number of items passed 12755, placement implies 1

Solution

  • You get that strange array because it treats every sequence as an entry and tries to one-hot encode it, we can use an example:

    import pandas as pd
    from sklearn.preprocessing import OneHotEncoder 
    df = pd.DataFrame({'sequence':['AQAVPW','AMAVLT','LDTGIN']})
    
    enc = OneHotEncoder()
    seq = np.array(df['sequence']).reshape(-1,1)
    encoded = enc.fit(seq)
    encoded.transform(seq).toarray()
    
    array([[0., 1., 0.],
           [1., 0., 0.],
           [0., 0., 1.]])
    
    encoded.categories_
    
    [array(['AMAVLT', 'AQAVPW', 'LDTGIN'], dtype=object)]
    

    Since your entries are unique, you get this all zeros matrix. You can understand this better if you use pd.get_dummies

    pd.get_dummies(df['sequence'])
    
      AMAVLT AQAVPW LDTGIN
    0   0   1   0
    1   1   0   0
    2   0   0   1
    

    There's two ways to do this, one way is to simply count the amino acid occurrence and use that as a predictor, I hope I get the amino acids correct (from school long time ago):

    from Bio import SeqIO
    from Bio.SeqUtils.ProtParam import ProteinAnalysis
    
    pd.DataFrame([ProteinAnalysis(i).count_amino_acids() for i in df['sequence']])
    
        A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y
    0   2   0   0   0   0   0   0   0   0   0   0   0   1   1   0   0   0   1   1   0
    1   2   0   0   0   0   0   0   0   0   1   1   0   0   0   0   0   1   1   0   0
    2   0   0   1   0   0   1   0   1   0   1   0   1   0   0   0   0   1   0   0   0
    

    The other is to split the sequences, and do this encoding by position, and this requires the sequences to be equally long, and that you have enough memory:

    byposition = df['sequence'].apply(lambda x:pd.Series(list(x)))
    byposition
    
        0   1   2   3   4   5
    0   A   Q   A   V   P   W
    1   A   M   A   V   L   T
    2   L   D   T   G   I   N
    
    pd.get_dummies(byposition)
    
        0_A 0_L 1_D 1_M 1_Q 2_A 2_T 3_G 3_V 4_I 4_L 4_P 5_N 5_T 5_W
    0   1   0   0   0   1   1   0   0   1   0   0   1   0   0   1
    1   1   0   0   1   0   1   0   0   1   0   1   0   0   1   0
    2   0   1   1   0   0   0   1   1   0   1   0   0   1   0   0