Search code examples
pythonmatplotlibplotk-means

K-means does not plotting correctly


I have a data to made a k-means clustering:

enter image description here

from sklearn.cluster import KMeans
num_clusters = 5
km = KMeans(n_clusters = num_clusters, init="random", max_iter=100, n_init=1)
x=km.fit(X)
print(km.labels_)
Output:
 [3 0 1 ... 2 0 0]

Then i made a plot:

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set() 
plt.scatter(X[:,0],X[:,1], c=km.labels_, cmap='rainbow')

But got this result:

enter image description here

What could be a reason why i got this results?


Solution

  • You're plotting the two first dimensions of X (race and gender) with colors being the clusters found by K-means. There's thus no surprise why you get these results.

    I believe what you are looking for is a way to visually check that the clustering done by K-means makes sense. For that, you'll have to visualise all the features used by K-means to make clusters: but that's 41, and our eyes can not see more than 4.

    An interesting solution here is dimension reduction: most of the information in the 41 features can be synthetized into less (e.g. 2). For example using principal component analysis (PCA), you can compress X into two features. Try the following:

    from sklearn.decomposition import PCA
    X_pca = PCA.fit_transform(X, n_dim=2)
    plt.scatter(X_pca[:,0], X_pca[:,1], c=km.labels_, cmap='rainbow')