I have a dataset where I'm trying to predict what kind of DNA a data entry is from the DNA makeup. For example, the string ATTAG...ACGAT
might translate to EI
. The possible outputs are either EI
, IE
, or N
. The dataset can be investigated further here. I tried switching out kernels from linear
to rbf
but the results are the same. The SVM classifier seems to output N
everytime. Any ideas why? I'm a beginner to Sci-kit Learn.
import pandas as pd
# 3190 total
training_data = pd.read_csv('new_training.csv')
test_data = pd.read_csv('new_test.csv')
frames = [training_data, test_data]
data = pd.concat(frames)
x = data.iloc[:, 0:59]
y = data.iloc[:, 60]
x = pd.get_dummies(x)
train_x = x.iloc[0:3000, :]
train_y = y.iloc[0:3000]
test_x = x.iloc[3000:3190, :]
test_y = y.iloc[3000:3190]
from sklearn import svm
from sklearn import preprocessing
clf = svm.SVC(kernel="rbf")
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(y)
print(label_encoder.transform(train_y))
clf.fit(train_x, label_encoder.transform(train_y))
for u in train_y.unique():
print(u)
predictions = clf.predict(test_x)
correct = 0
total = len(predictions)
for i in range(total):
prediction = label_encoder.inverse_transform(predictions[i])
print('predicted %s and actual %s' % (prediction, test_y[i]))
print(len(prediction))
if prediction == test_y[i]:
correct += 1
print('correct %d out of %d' % (correct, total))
First I import the training and test data, combine it and split it into as either x (inputs) or y (output label). Then I convert x into the dummy variable version from the original 60 columns to like 300~ columns since each DNA spot can be A
, T
, G
, C
and sometimes N
. Basically have either a 0 or 1 for all the possible inputs for each input. (Is there a better way to do this? Sci-kit learn doesn't support categorical encoding and I tried best I could from this.) Then I separate the data again (I had to merge so that I can generate dummies on the whole data space).
From here, I just run the svm stuff to fit the x
and y
labels and then to predict on test_x
. I also had to encode/label y
, from the string version to the numerical version. But yeah, it always produced N
which I feel like is wrong. How do I fix? Thank you!
I think the issue is the way data is splitted into train and test. You have taken the first 3000 samples for training and the remaining 190 samples for testing. I found out that with such training the classifier yields the true class label for all the test samples (score = 1.0). I have also noticed that the last 190 samples of the dataset have the same class label, namely 'N'
. Therefore the result you obtained is correct.
I would recommend you to split the dataset into train and test through ShuffleSplit
with test_size=.06
(this corresponds approximately to 190/3190 although to make visualization of results easier I used test_size=.01
in the sample run below). For the sake of simplicity I would also suggest you to use OneHotEncoder
to encode the categorical values of the features.
Here’s the full code (I have taken the liberty to perform some refactoring):
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import ShuffleSplit
from sklearn import svm
data = np.loadtxt(r'splice.data', delimiter=',', dtype='string')
bases = {'A': 0, 'C': 1, 'D': 2, 'G': 3, 'N': 4, 'R': 5, 'S': 6, 'T': 7}
X_base = np.asarray([[bases[c] for c in seq.strip()] for seq in data[:, 2]])
y_class = data[:, 0]
enc = OneHotEncoder(n_values=len(bases))
lb = LabelEncoder()
enc.fit(X_base)
lb.fit(y_class)
X = enc.transform(X_base).toarray()
y = lb.transform(y_class)
rs = ShuffleSplit(n_splits=1, test_size=.01, random_state=0)
train_index, test_index = rs.split(X).next()
train_X, train_y = X[train_index], y[train_index]
test_X, test_y = X[test_index], y[test_index]
clf = svm.SVC(kernel="rbf")
clf.fit(train_X, train_y)
predictions = clf.predict(test_X)
Demo:
Out[2]:
array(['IE', 'EI', 'EI', 'EI', 'EI', 'IE', 'N', 'N', 'EI', 'N', 'N', 'IE',
'IE', 'N', 'N', 'IE', 'EI', 'N', 'N', 'EI', 'IE', 'EI', 'IE', 'N',
'EI', 'N', 'IE', 'N', 'EI', 'N', 'N', 'EI'],
dtype='|S79')
In [3]: y_class[test_index]
Out[3]:
array(['IE', 'EI', 'EI', 'EI', 'EI', 'IE', 'N', 'N', 'EI', 'N', 'N', 'IE',
'IE', 'N', 'N', 'IE', 'EI', 'N', 'N', 'EI', 'IE', 'EI', 'IE', 'N',
'IE', 'N', 'IE', 'N', 'EI', 'N', 'N', 'EI'],
dtype='|S79')
In [4]: clf.score(test_X, test_y)
Out[4]: 0.96875
Note: Make certain your sklearn version is 0.18.1, otherwise the code above might not work.