Search code examples
pythonk-meanspca

Am I interpreting K-means results correctly?


I have implemented k-means elbow plot to find the optimum K for my data (after doing PCA). I have gotten the elbow plot shown below. My question is: I think the optimum K is 3 in my case (this is where a sudden drop occurs/point of inflection)? But looking at my X_PCA_1 VS. X_PCA_2 plot, I think the data can be clustered into 2 clusters only? or am I mistaken?

Note: I am still a beginner.

K-elbow

X_PCA_1 VS. X_PCA_2


Solution

  • If you want to plot to see clearly the clusters, first you can use PCA with 3 components:

    pca = PCA(3)
    X_pca = pca.fit_transform(scaled_df)
    

    Then, you can append each point to each dimension:

    X = []
    Y = []
    Z = []
    for i in X_pca:
        X.append(i[0])
        Y.append(i[1])
        Z.append(i[2])
    

    From here you can choose a library to plot 3d graphs.

    model = KMeans(n_clusters=3)
    cluster_kmeans = model.fit_predict(scaled_df)
    
    df_graph = pd.DataFrame({'X': X,
                             'Y': Y,
                             'Z': Z,
                             'labels': cluster_kmeans
                             })
    
    fig = plt.figure(figsize=(20, 10))
    ax = fig.add_subplot(111, projection='3d')
    
    for s in df_graph.labels.unique():
        ax.scatter(df_graph.X[df_graph.labels==s],df_graph.Y[df_graph.labels==s],df_graph.Z[df_graph.labels==s],label=s)
        
    ax.legend()
    plt.show()