Search code examples
machine-learningscikit-learnscientific-computing

Python (scikit learn) lda collapsing to single dimension


I'm very new to scikit learn and machine learning in general.

I am currently designing a SVM to predict if a specific amino acid sequence will be cut by a protease. So far the the SVM method seems to be working quite well: sensitivity and specificity of one of my SVM models

I'd like to visualize the distance between the two categories (cut and uncut), so I'm trying to use the linear discrimination analysis, which is similar to the principal component analysis, using the following code:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
targs = np.array([1 if _ else 0 for _ in XOR_list])
DATA = np.array(data_list)
X_r2 = lda.fit(DATA, targs).transform(DATA)
plt.figure()
for c, i, target_name in zip("rg", [1, 0],["Cleaved","Not Cleaved"]):
    plt.scatter(X_r2[targs == i], X_r2[targs == i], c=c, label=target_name)
plt.legend()
plt.title('LDA of cleavage_site dataset')

However, the LDA is only giving a 1D result

In: print X_r2[:5]
Out: [[ 6.74369996]
 [ 4.14254941]
 [ 5.19537896]
 [ 7.00884032]
 [ 3.54707676]]

enter image description here

However, the pca analysis will give 2 dimensions with the data I am inputting:

pca = PCA(n_components=2)
X_r = pca.fit(DATA).transform(DATA)
print X_r[:5]
Out: [[ 0.05474151  0.38401203]
 [ 0.39244191  0.74113729]
 [-0.56785236 -0.30109694]
 [-0.55633116 -0.30267444]
 [ 0.41311866 -0.25501662]]

edit: here is a link to two google-docs with the input data. I am not using the sequence information, just the numerical information that follows. The files are split up between positive and negative control data. Input data: file1 file2


Solution

  • LDA is not a dimensionality reduction technique. LDA is a classifier, the fact that people visualize decision function is just a side effect, and - unfortunately for your use case - decision function for binary problem (2 classes) is 1 dimensional. There is nothing wrong with your code, this is how every single decision function of a linear binary classifier looks like.

    In general for 2 classes you get at most 1-dim projection and for K>2 classes you can get up to K-dim projection. With other decomposition techniques (like 1 vs 1) you can go up to K(K-1)/2 but again, only for more than 2 classes.