Search code examples
python-3.xk-meansscatter-plotpca

Is it correct to view PCA scatter plot using k-means predicted labels


For the data given here:

feat_1  feat_2  Label
4.818919448 -8.88997718 0
2.239877125 -7.142062835    0
2.715454379 -9.392740116    0
1.457970779 -9.295304121    0
3.396769719 -4.696564243    0
-0.251264375    -3.11639814 0
1.553138885 -2.56360423 0
2.556077961 -1.639727669    0
3.264100784 -5.353501855    0
5.54079929  -2.810777111    0
-2.063969924    0.127805678 1
-1.691797179    0.835738844 1
-1.350084344    0.469993022 1
-1.672611658    0.873301506 1
-1.956488821    0.804911876 1
-1.529121941    1.112561558 1
-2.091905556    0.72908025  1
-1.835806179    0.801126086 1
-1.963433251    0.558394092 1
-2.576833733    -0.148751731    1
5.262121279 -0.291153029    2
4.150999653 4.60229228  2
2.538967939 5.642889255 2
9.908816157 2.380103599 2
9.876931469 2.29522071  2
6.691577612 -2.214740473    2
11.75361142 9.650193692 2
4.099660592 5.048216039 2
8.49165607  2.47194124  2
8.243607045 2.831411268 2

Where X is given as the features (first 2 columns of table) and the labels y is given by the third column.

I am using PCA then doing a k-means clustering.

CODE

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA

X = df.drop(columns=['Label']).values
y = df['Label'].values
pca = PCA().fit(X)
x_pca = pca.fit_transform(X)

from sklearn.cluster import KMeans

k_means = KMeans(n_clusters=3, random_state=42)
k_means = k_means.fit(x_pca)
kmeans_labels = k_means.predict(x_pca)
kmeans_labels

target_names = ['class_0', 'class_1', 'class_2']
plt.figure(figsize=(8,6))
plot = plt.scatter(x_pca[:,0],x_pca[:,1],c=y,s=20, cmap=plt.cm.jet, linewidths=0, alpha=0.5)
plt.scatter(k_means.cluster_centers_[:,0], k_means.cluster_centers_[:,1], marker="x", color='k', s=40)
plt.legend(handles=plot.legend_elements()[0], labels=list(target_names))
plt.xlabel('feat_1')
plt.ylabel('feat_2')
plt.title('KMeans')
plt.show()

If I use c=y in the plt.scatter plot, I get this:

enter image description here

If I use c=kmeans_labels in the plt.scatter plot, I then get this:

enter image description here

The second plot separates the classes nicely.

Is this a correct view?

Also, can this data seperation be used to train a model like this:

X_train, X_test, y_train, y_test = train_test_split(x_pca, kmeans_labels, test_size=0.3, random_state=42)

or do I have to stick with the original labels like this:

X_train, X_test, y_train, y_test = train_test_split(x_pca, y, test_size=0.3, random_state=42)

where: y = df['Label'].values?

Thanks for your help and time!


Solution

  • When you're using kmeans' labels for visualization, you are showing how data is clustered ignoring the original labels. But your data is already labeled so clustering doesn't make sense. So the first visualization is the correct one and in the same way you should only use the original labels for training any models.

    Based on the first visualization it seems that your classes are very intertwined and probably it would be impossible for simple models to predict. If it's possible I would recommend additional feature engineering before using models. But for any additional recommendations, we would need more information about your data.