Search code examples
pythonmachine-learningdata-sciencek-means

Count data points for each K-means cluster


I have a dataset for banknotes wavelet data of genuine and forged banknotes with 2 features which are:

  1. X axis: Variance of Wavelet Transformed image
  2. Y axis: Skewness of Wavelet Transformed image

I run on this dataset K-means to identify 2 clusters of the data which are basically genuine and forged banknotes.

Now I have 3 questions:

  1. How can I count the data points of each cluster?
  2. How can I set a color of each data point based on it's cluster?
  3. How do I know without another feature in the data if the datapoint is genuine or forged? I know the data set has a "class" which shows 1 and 2 for genuine and forged but can I identify this without the "class" feature?

My code:

import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as patches
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.patches as patches

data = pd.read_csv('Banknote-authentication-dataset-all.csv')

V1 = data['V1']
V2 = data['V2']
bn_class = data['Class']


V1_min = np.min(V1)
V1_max = np.max(V1)

V2_min = np.min(V2)
V2_max = np.max(V2)

normed_V1 = (V1 - V1_min)/(V1_max - V1_min)
normed_V2 = (V2 - V2_min)/(V2_max - V2_min)

V1_mean = normed_V1.mean()
V2_mean = normed_V2.mean()

V1_std_dev = np.std(normed_V1)
V2_std_dev = np.std(normed_V2)

ellipse = patches.Ellipse([V1_mean, V2_mean], V1_std_dev*2, V2_std_dev*2, alpha=0.4)

V1_V2 = np.column_stack((normed_V1, normed_V2))

km_res = KMeans(n_clusters=2).fit(V1_V2)
clusters = km_res.cluster_centers_

plt.xlabel('Variance of Wavelet Transformed image')
plt.ylabel('Skewness of Wavelet Transformed image')
scatter = plt.scatter(normed_V1,normed_V2, s=10, c=bn_class, cmap='coolwarm')
#plt.scatter(V1_std_dev, V2_std_dev,s=400, Alpha=0.5)
plt.scatter(V1_mean, V2_mean, s=400, Alpha=0.8, c='lightblue')
plt.scatter(clusters[:,0], clusters[:,1],s=3000,c='orange', Alpha=0.8)
unique = list(set(bn_class))

plt.text(1.1, 0, 'Kmeans cluster centers', bbox=dict(facecolor='orange'))
plt.text(1.1, 0.11, 'Arithmetic Mean', bbox=dict(facecolor='lightblue'))
plt.text(1.1, 0.33, 'Class 1 - Genuine Notes',color='white', bbox=dict(facecolor='blue'))
plt.text(1.1, 0.22, 'Class 2 - Forged Notes', bbox=dict(facecolor='red'))

plt.savefig('figure.png',bbox_inches='tight')

plt.show()

Appendix image for better visibility

Code

enter image description here


Solution

    1. How to count the data points of each cluster

    You can do this easily by using fit_predict instead of fit, or calling predict on your training data after fitting it.

    Here's a working example:

    kM = KMeans(...).fit_predict(V1_V2)
    labels = kM.labels_
    
    clusterCount = np.bincount(labels)
    

    clusterCount will now hold your information for how many points are in each cluster. You can just as easily do this with fit then predict, but this should be more efficient:

    kM = KMeans(...).fit(V1_V2)
    labels = kM.predict(V1_V2)
    
    clusterCount = np.bincount(labels)
    
    1. To set its color, use kM.labels_ or the output of kM.predict() as a coloring index.
    labels = kM.predict(V1_V2)
    
    plt.scatter(normed_V1, normed_V2, s=10, c=labels, cmap='coolwarm') # instead of c=bn_class
    
    1. For a new data point, notice how the KMeans you have quite nicely separates out the majority of the two classes. This separability means you can actually use your KMeans clusters as predictors. Simply use predict.
    predictedClass = KMeans.predict(newDataPoint)
    

    Where a cluster is assigned the value of the class which it has the majority of. Or even a percentage chance.