python pandas numpy scikit-learn sklearn-pandas

Prediction always the same while using Sci-kit Learn SVM

I have a dataset where I'm trying to predict what kind of DNA a data entry is from the DNA makeup. For example, the string ATTAG...ACGAT might translate to EI. The possible outputs are either EI, IE, or N. The dataset can be investigated further here. I tried switching out kernels from linear to rbf but the results are the same. The SVM classifier seems to output N everytime. Any ideas why? I'm a beginner to Sci-kit Learn.

import pandas as pd
# 3190 total
training_data = pd.read_csv('new_training.csv')
test_data = pd.read_csv('new_test.csv')
frames = [training_data, test_data]
data = pd.concat(frames)
x = data.iloc[:, 0:59]
y = data.iloc[:, 60]

x = pd.get_dummies(x)
train_x = x.iloc[0:3000, :]
train_y = y.iloc[0:3000]
test_x = x.iloc[3000:3190, :]
test_y = y.iloc[3000:3190]

from sklearn import svm
from sklearn import preprocessing

clf = svm.SVC(kernel="rbf")
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(y)

print(label_encoder.transform(train_y))
clf.fit(train_x, label_encoder.transform(train_y))

for u in train_y.unique():
    print(u)

predictions = clf.predict(test_x)

correct = 0
total = len(predictions)
for i in range(total):
    prediction = label_encoder.inverse_transform(predictions[i])
    print('predicted %s and actual %s' % (prediction, test_y[i]))
    print(len(prediction))
    if prediction == test_y[i]:
        correct += 1

print('correct %d out of %d' % (correct, total))

First I import the training and test data, combine it and split it into as either x (inputs) or y (output label). Then I convert x into the dummy variable version from the original 60 columns to like 300~ columns since each DNA spot can be A, T, G, C and sometimes N. Basically have either a 0 or 1 for all the possible inputs for each input. (Is there a better way to do this? Sci-kit learn doesn't support categorical encoding and I tried best I could from this.) Then I separate the data again (I had to merge so that I can generate dummies on the whole data space).

From here, I just run the svm stuff to fit the x and y labels and then to predict on test_x. I also had to encode/label y, from the string version to the numerical version. But yeah, it always produced N which I feel like is wrong. How do I fix? Thank you!

Solution

I think the issue is the way data is splitted into train and test. You have taken the first 3000 samples for training and the remaining 190 samples for testing. I found out that with such training the classifier yields the true class label for all the test samples (score = 1.0). I have also noticed that the last 190 samples of the dataset have the same class label, namely 'N'. Therefore the result you obtained is correct.

I would recommend you to split the dataset into train and test through ShuffleSplit with test_size=.06 (this corresponds approximately to 190/3190 although to make visualization of results easier I used test_size=.01 in the sample run below). For the sake of simplicity I would also suggest you to use OneHotEncoder to encode the categorical values of the features.

Here’s the full code (I have taken the liberty to perform some refactoring):

import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import ShuffleSplit
from sklearn import svm

data = np.loadtxt(r'splice.data', delimiter=',', dtype='string')

bases = {'A': 0, 'C': 1, 'D': 2, 'G': 3, 'N': 4, 'R': 5, 'S': 6, 'T': 7}

X_base = np.asarray([[bases[c] for c in seq.strip()] for seq in data[:, 2]])
y_class = data[:, 0]

enc = OneHotEncoder(n_values=len(bases))
lb = LabelEncoder()

enc.fit(X_base)  
lb.fit(y_class)

X = enc.transform(X_base).toarray()
y = lb.transform(y_class)

rs = ShuffleSplit(n_splits=1, test_size=.01, random_state=0)
train_index, test_index = rs.split(X).next()
train_X, train_y = X[train_index], y[train_index]
test_X, test_y = X[test_index], y[test_index]

clf = svm.SVC(kernel="rbf")
clf.fit(train_X, train_y)

predictions = clf.predict(test_X)

Demo:

Out[2]: 
array(['IE', 'EI', 'EI', 'EI', 'EI', 'IE', 'N', 'N', 'EI', 'N', 'N', 'IE',
       'IE', 'N', 'N', 'IE', 'EI', 'N', 'N', 'EI', 'IE', 'EI', 'IE', 'N',
       'EI', 'N', 'IE', 'N', 'EI', 'N', 'N', 'EI'], 
      dtype='|S79')

In [3]: y_class[test_index]
Out[3]: 
array(['IE', 'EI', 'EI', 'EI', 'EI', 'IE', 'N', 'N', 'EI', 'N', 'N', 'IE',
       'IE', 'N', 'N', 'IE', 'EI', 'N', 'N', 'EI', 'IE', 'EI', 'IE', 'N',
       'IE', 'N', 'IE', 'N', 'EI', 'N', 'N', 'EI'], 
      dtype='|S79')

In [4]: clf.score(test_X, test_y)
Out[4]: 0.96875

Note: Make certain your sklearn version is 0.18.1, otherwise the code above might not work.