For the data given here:
feat_1 feat_2 Label
4.818919448 -8.88997718 0
2.239877125 -7.142062835 0
2.715454379 -9.392740116 0
1.457970779 -9.295304121 0
3.396769719 -4.696564243 0
-0.251264375 -3.11639814 0
1.553138885 -2.56360423 0
2.556077961 -1.639727669 0
3.264100784 -5.353501855 0
5.54079929 -2.810777111 0
-2.063969924 0.127805678 1
-1.691797179 0.835738844 1
-1.350084344 0.469993022 1
-1.672611658 0.873301506 1
-1.956488821 0.804911876 1
-1.529121941 1.112561558 1
-2.091905556 0.72908025 1
-1.835806179 0.801126086 1
-1.963433251 0.558394092 1
-2.576833733 -0.148751731 1
5.262121279 -0.291153029 2
4.150999653 4.60229228 2
2.538967939 5.642889255 2
9.908816157 2.380103599 2
9.876931469 2.29522071 2
6.691577612 -2.214740473 2
11.75361142 9.650193692 2
4.099660592 5.048216039 2
8.49165607 2.47194124 2
8.243607045 2.831411268 2
Where X
is given as the features (first 2 columns of table) and the labels y
is given by the third column.
I am using PCA then doing a k-means clustering.
CODE
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
X = df.drop(columns=['Label']).values
y = df['Label'].values
pca = PCA().fit(X)
x_pca = pca.fit_transform(X)
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=42)
k_means = k_means.fit(x_pca)
kmeans_labels = k_means.predict(x_pca)
kmeans_labels
target_names = ['class_0', 'class_1', 'class_2']
plt.figure(figsize=(8,6))
plot = plt.scatter(x_pca[:,0],x_pca[:,1],c=y,s=20, cmap=plt.cm.jet, linewidths=0, alpha=0.5)
plt.scatter(k_means.cluster_centers_[:,0], k_means.cluster_centers_[:,1], marker="x", color='k', s=40)
plt.legend(handles=plot.legend_elements()[0], labels=list(target_names))
plt.xlabel('feat_1')
plt.ylabel('feat_2')
plt.title('KMeans')
plt.show()
If I use c=y
in the plt.scatter plot, I get this:
If I use c=kmeans_labels
in the plt.scatter plot, I then get this:
The second plot separates the classes nicely.
Is this a correct view?
Also, can this data seperation be used to train a model like this:
X_train, X_test, y_train, y_test = train_test_split(x_pca, kmeans_labels, test_size=0.3, random_state=42)
or do I have to stick with the original labels like this:
X_train, X_test, y_train, y_test = train_test_split(x_pca, y, test_size=0.3, random_state=42)
where: y = df['Label'].values
?
Thanks for your help and time!
When you're using kmeans
' labels for visualization, you are showing how data is clustered ignoring the original labels. But your data is already labeled so clustering doesn't make sense. So the first visualization is the correct one and in the same way you should only use the original labels for training any models.
Based on the first visualization it seems that your classes are very intertwined and probably it would be impossible for simple models to predict. If it's possible I would recommend additional feature engineering before using models. But for any additional recommendations, we would need more information about your data.